The NFL is a multi-billion dollar business. There are millions of fans around the world, possibily even yourself, that invest a significant amount of time and resources to follow and watch their favorite teams and players. The teams are always working to fufill their duty to the fans. With the money they receive, they work on providing fans with the best experience possible, whether that's through giving them a great experience at the game with stadium rennovations or through building a winning team that makes fans want to tune in more. Every year, teams get around 225 million dollars to spend on players, with the money being paid through contracts. It's pretty straightforward, teams spend money on players that they think will benefit the team, and all is good, right? Obviously the answer is no, and there is a lot of nuance and subtleties that go into how much to pay players. Time and time again teams either give way too much money to certain players which ends up hurting the team, and many times teams avoid signing a certain player who ends up being a bargain for another. Then there is the question of how do we weigh which positions are worth more money. Each position has vastly different roles on a team, and therefore some are paid better than others. This however, also depends on the scheme the team has as that can affect how much they value each position. Then there are many more factors such as age and past performance that have to be accounted for. Then after all of that, even if done perfectly to the book can still end up not working. A player could not be a hard worker or lazy or had a fluke year, all of which are very hard to account for. So which are the factors you can account for? This is the question we will answer today, by looking through both basic box score predictions as well as ProFootballFocus' grading system, which breaks down the tape for every play of every player. This brings an element of context to the stats, which also can be deceiving based on what happens in a play, for example a quarterback can throw a perfect pass that gets dropped which would be considered an incompletion. We will do analysis on all different types of stats, advanced and basic, to determine which ones correlate with salary the best or how much each one correlates. This will give us a better picture of how we can predict what teams should pay a player based on their past performance and how much value in general they should add to a team. We will break down each category individually, by passing, rushing, receiving, blocking, run defense, pass rush and coverage, which will give us a nice split of positions and will allow us to differentiate the value of certain skills.
import pandas as pd
import numpy as np
import warnings
# filter out warnings
warnings.filterwarnings('ignore')
# get salary data
tables = pd.read_html('https://www.spotrac.com/nfl/contracts/sort-value/limit-2000/', flavor='html5lib')
First we must collect data, to do this we used the python library Pandas to read the html. Since the website we are collecting from stores the data we're looking for in a table tag we can easily grab it and store it in a dataframe. This website contains information for players salary including when they signed, total value, average annual value (AAV), and information relating to guaranteed salary.
import re
salary_df = tables[0].rename(columns={'Player': 'player'})
players = salary_df['player']
pl = []
start_years = []
end_years = []
for player in players:
try:
# split up player column and extract start year end year and name
groups = re.search(r'(.*) .* \| (\d{4})-(\d{4}) \(FA: (\d{4})\)', player)
pl.append(groups.groups()[0])
start_years.append(groups.groups()[1])
end_years.append(groups.groups()[2])
# if regex fails dont throw error
except:
pl.append(None)
start_years.append(None)
end_years.append(None)
# insert clean data
salary_df['player'] = pl
salary_df['start_year'] = start_years
salary_df['end_year'] = end_years
# turn salaries totals into ints
def convert_to_int(v):
try: return int(v.replace('$', '').replace(',',''))
except: return 0
salary_df['Value'] = salary_df['Value'].apply(convert_to_int)
salary_df['AAV'] = salary_df['AAV'].apply(convert_to_int)
salary_df['Sign Bonus'] = salary_df['Sign Bonus'].apply(convert_to_int)
salary_df["G'teed @ Sign"] = salary_df["G'teed @ Sign"].apply(convert_to_int)
salary_df["Practical G'teed"] = salary_df["Practical G'teed"].apply(convert_to_int)
salary_df['start_year'] = salary_df['start_year'].apply(convert_to_int)
salary_df['end_year'] = salary_df['end_year'].apply(convert_to_int)
salary_df.fillna(0)
salary_df
| Rank | player | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Patrick Mahomes | 24 | 10 | 450000000 | 45000000 | 10000000 | 63081905 | 141481905 | 2020 | 2031 |
| 1 | 2 | Lamar Jackson | 26 | 5 | 260000000 | 52000000 | 72500000 | 135000000 | 185000000 | 2023 | 2027 |
| 2 | 3 | Josh Allen | 25 | 6 | 258034000 | 43005667 | 16500000 | 100038596 | 150000000 | 2021 | 2028 |
| 3 | 4 | Jalen Hurts | 24 | 5 | 255000000 | 51000000 | 23294000 | 110000000 | 179399000 | 2023 | 2028 |
| 4 | 5 | Russell Wilson | 33 | 5 | 242588236 | 48517647 | 50000000 | 124000000 | 161000000 | 2022 | 2028 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1995 | 1995 | Joshua Miles | 27 | 1 | 1162500 | 1162500 | 152500 | 352500 | 352500 | 2023 | 2023 |
| 1996 | 1997 | Joe Bachie | 25 | 1 | 1160000 | 1160000 | 75000 | 75000 | 75000 | 2023 | 2023 |
| 1997 | 1998 | Jordan Willis | 27 | 1 | 1156250 | 1156250 | 0 | 0 | 0 | 2023 | 2023 |
| 1998 | 1999 | Tim Boyle | 28 | 1 | 1155000 | 1155000 | 0 | 75000 | 75000 | 2023 | 2023 |
| 1999 | 1999 | Marcus Kemp | 27 | 1 | 1155000 | 1155000 | 50000 | 50000 | 50000 | 2023 | 2023 |
2000 rows × 11 columns
Now we need to get the player stats. Do do this we extracted csv files from pff.com which contain many useful statistics for each position. Again we are going to store this in a pandas dataframe to be consistent, since we will use all the datasets together. We also need to clean the data. We do this by melting the data and removing any years in which very minimal snaps were played and to only include positions we want. We want to differentiate the years because of how important it is to account for. A player having a good season 5 years ago is not nearly the same as them playing well in the last season.
# passing players and stats
passing_dfs = []
# passing
# 2022
passing_dfs.append(pd.read_csv('./passing/passing_summary.csv'))
passing_dfs[0]['year'] = 2022
# 2021
passing_dfs.append(pd.read_csv('./passing/passing_summary (1).csv'))
passing_dfs[1]['year'] = 2021
# 2020
passing_dfs.append(pd.read_csv('./passing/passing_summary (2).csv'))
passing_dfs[2]['year'] = 2020
# 2019
passing_dfs.append(pd.read_csv('./passing/passing_summary (3).csv'))
passing_dfs[3]['year'] = 2019
# 2018
passing_dfs.append(pd.read_csv('./passing/passing_summary (4).csv'))
passing_dfs[4]['year'] = 2018
# 2017
passing_dfs.append(pd.read_csv('./passing/passing_summary (5).csv'))
passing_dfs[5]['year'] = 2017
# create dataframe
passing_df = pd.concat(passing_dfs)
# remove outliers
passing_df = passing_df[passing_df['position'] == 'QB']
passing_df = passing_df[passing_df['attempts'] > 100]
# merge players with salary
passing_df = pd.merge(passing_df, salary_df, on='player')
passing_df
| player | player_id | position | team_name | player_game_count | accuracy_percent | aimed_passes | attempts | avg_depth_of_target | avg_time_to_throw | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Patrick Mahomes | 11765 | QB | KC | 20 | 78.0 | 685 | 747 | 7.5 | 2.85 | ... | 1 | 24 | 10 | 450000000 | 45000000 | 10000000 | 63081905 | 141481905 | 2020 | 2031 |
| 1 | Patrick Mahomes | 11765 | QB | KC | 20 | 79.2 | 708 | 780 | 7.4 | 2.87 | ... | 1 | 24 | 10 | 450000000 | 45000000 | 10000000 | 63081905 | 141481905 | 2020 | 2031 |
| 2 | Patrick Mahomes | 11765 | QB | KC | 18 | 77.1 | 656 | 705 | 8.5 | 2.91 | ... | 1 | 24 | 10 | 450000000 | 45000000 | 10000000 | 63081905 | 141481905 | 2020 | 2031 |
| 3 | Patrick Mahomes | 11765 | QB | KC | 17 | 77.2 | 549 | 596 | 8.5 | 2.84 | ... | 1 | 24 | 10 | 450000000 | 45000000 | 10000000 | 63081905 | 141481905 | 2020 | 2031 |
| 4 | Patrick Mahomes | 11765 | QB | KC | 18 | 77.8 | 595 | 652 | 9.6 | 2.80 | ... | 1 | 24 | 10 | 450000000 | 45000000 | 10000000 | 63081905 | 141481905 | 2020 | 2031 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 181 | Case Keenum | 7323 | QB | MIN | 17 | 75.1 | 527 | 569 | 7.9 | 2.72 | ... | 634 | 35 | 2 | 6250000 | 3125000 | 750000 | 4000000 | 4000000 | 2023 | 2024 |
| 182 | David Blough | 46508 | QB | DET | 5 | 68.6 | 159 | 174 | 8.9 | 2.89 | ... | 1895 | 27 | 1 | 1232500 | 1232500 | 76250 | 76250 | 76250 | 2023 | 2023 |
| 183 | Blaine Gabbert | 6162 | QB | TEN | 6 | 66.3 | 98 | 101 | 7.7 | 2.29 | ... | 1832 | 33 | 1 | 1317500 | 1317500 | 152500 | 1092500 | 1092500 | 2023 | 2023 |
| 184 | Blaine Gabbert | 6162 | QB | ARZ | 5 | 65.8 | 161 | 171 | 9.8 | 2.58 | ... | 1832 | 33 | 1 | 1317500 | 1317500 | 152500 | 1092500 | 1092500 | 2023 | 2023 |
| 185 | Brian Hoyer | 5277 | QB | NE | 8 | 68.8 | 202 | 211 | 8.4 | 2.39 | ... | 836 | 37 | 2 | 4500000 | 2250000 | 875000 | 4205000 | 4205000 | 2023 | 2024 |
186 rows × 53 columns
Here are the rushing stats, we will limit these to just running backs.
# rushing players and stats
rushing_dfs = []
# rushing
# 2022
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary.csv'))
rushing_dfs[0]['year'] = 2022
# 2021
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (1).csv'))
rushing_dfs[1]['year'] = 2021
# 2020
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (2).csv'))
rushing_dfs[2]['year'] = 2020
# 2019
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (3).csv'))
rushing_dfs[3]['year'] = 2019
# 2018
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (4).csv'))
rushing_dfs[4]['year'] = 2018
# 2017
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (5).csv'))
rushing_dfs[5]['year'] = 2017
# create dataframe
rushing_df = pd.concat(rushing_dfs)
# remove outliers
rushing_df = rushing_df[rushing_df['position'] == 'HB']
rushing_df = rushing_df[rushing_df['attempts'] > 50]
# merge players with salary
rushing_df = pd.merge(rushing_df, salary_df, on='player')
rushing_df
| player | player_id | position | team_name | player_game_count | attempts | avoided_tackles | breakaway_attempts | breakaway_percent | breakaway_yards | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Derrick Henry | 10679 | HB | TEN | 16 | 349 | 69 | 15 | 27.6 | 424 | ... | 142 | 26 | 4 | 50000000 | 12500000 | 12000000 | 25500000 | 25500000 | 2020 | 2023 |
| 1 | Derrick Henry | 10679 | HB | TEN | 9 | 239 | 36 | 9 | 26.4 | 264 | ... | 142 | 26 | 4 | 50000000 | 12500000 | 12000000 | 25500000 | 25500000 | 2020 | 2023 |
| 2 | Derrick Henry | 10679 | HB | TEN | 17 | 396 | 77 | 21 | 31.2 | 645 | ... | 142 | 26 | 4 | 50000000 | 12500000 | 12000000 | 25500000 | 25500000 | 2020 | 2023 |
| 3 | Derrick Henry | 10679 | HB | TEN | 18 | 386 | 71 | 23 | 33.5 | 665 | ... | 142 | 26 | 4 | 50000000 | 12500000 | 12000000 | 25500000 | 25500000 | 2020 | 2023 |
| 4 | Derrick Henry | 10679 | HB | TEN | 16 | 215 | 45 | 9 | 27.6 | 292 | ... | 142 | 26 | 4 | 50000000 | 12500000 | 12000000 | 25500000 | 25500000 | 2020 | 2023 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 208 | Ameer Abdullah | 9487 | HB | DET | 14 | 165 | 20 | 7 | 31.9 | 176 | ... | 1749 | 29 | 1 | 1750000 | 1750000 | 200000 | 500000 | 500000 | 2023 | 2023 |
| 209 | Benny Snell Jr. | 45954 | HB | PIT | 16 | 113 | 14 | 3 | 21.0 | 80 | ... | 1275 | 21 | 4 | 3222068 | 805517 | 702068 | 702068 | 702068 | 2019 | 2022 |
| 210 | Benny Snell Jr. | 45954 | HB | PIT | 11 | 108 | 20 | 5 | 23.0 | 98 | ... | 1275 | 21 | 4 | 3222068 | 805517 | 702068 | 702068 | 702068 | 2019 | 2022 |
| 211 | Justice Hill | 45865 | HB | BLT | 17 | 58 | 8 | 1 | 8.0 | 18 | ... | 835 | 25 | 2 | 4510000 | 2255000 | 1000000 | 1250000 | 1250000 | 2023 | 2024 |
| 212 | Ty Montgomery | 9527 | HB | GB | 8 | 71 | 4 | 1 | 13.6 | 37 | ... | 1210 | 29 | 2 | 3600000 | 1800000 | 300000 | 300000 | 300000 | 2022 | 2023 |
213 rows × 58 columns
Here are receiving stats. These are a little more complicated since these will contain primary statistics for tight ends and wide receivers. We will use the data from these files for two seperate dataframes one for the wide receivers and one for the tight ends.
# receiving players and stats
receiving_dfs = []
# receiving
# 2022
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary.csv'))
receiving_dfs[0]['year'] = 2022
# 2021
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (1).csv'))
receiving_dfs[1]['year'] = 2021
# 2020
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (2).csv'))
receiving_dfs[2]['year'] = 2020
# 2019
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (3).csv'))
receiving_dfs[3]['year'] = 2019
# 2018
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (4).csv'))
receiving_dfs[4]['year'] = 2018
# 2017
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (5).csv'))
receiving_dfs[5]['year'] = 2017
# create dataframe
receiving_df = pd.concat(receiving_dfs)
# remove outliers
receiving_df = receiving_df[receiving_df['targets'] > 50]
# merge players with salary
receiving_df = pd.merge(receiving_df, salary_df, on='player')
receiving_df
| player | player_id | position | team_name | player_game_count | avg_depth_of_target | avoided_tackles | caught_percent | contested_catch_rate | contested_receptions | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Justin Jefferson | 61398 | WR | MIN | 18 | 10.7 | 11 | 73.0 | 55.0 | 22 | ... | 422 | 21 | 4 | 13122805 | 3280701 | 7103856 | 13122805 | 13122805 | 2020 | 2024 |
| 1 | Justin Jefferson | 61398 | WR | MIN | 17 | 13.3 | 9 | 66.3 | 48.1 | 13 | ... | 422 | 21 | 4 | 13122805 | 3280701 | 7103856 | 13122805 | 13122805 | 2020 | 2024 |
| 2 | Justin Jefferson | 61398 | WR | MIN | 16 | 12.0 | 13 | 72.7 | 54.5 | 12 | ... | 422 | 21 | 4 | 13122805 | 3280701 | 7103856 | 13122805 | 13122805 | 2020 | 2024 |
| 3 | Travis Kelce | 7844 | TE | KC | 20 | 7.5 | 25 | 75.3 | 45.5 | 10 | ... | 115 | 30 | 4 | 57250000 | 14312500 | 0 | 20750000 | 22750000 | 2020 | 2025 |
| 4 | Travis Kelce | 7844 | TE | KC | 19 | 7.6 | 23 | 72.8 | 60.0 | 9 | ... | 115 | 30 | 4 | 57250000 | 14312500 | 0 | 20750000 | 22750000 | 2020 | 2025 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 492 | Jordan Akins | 47050 | TE | HST | 16 | 7.4 | 8 | 67.9 | 25.0 | 2 | ... | 1074 | 30 | 2 | 3900000 | 1950000 | 630000 | 1730000 | 1730000 | 2023 | 2024 |
| 493 | O.J. Howard | 11774 | TE | TB | 14 | 10.1 | 2 | 65.4 | 41.7 | 5 | ... | 1895 | 28 | 1 | 1232500 | 1232500 | 76250 | 451250 | 451250 | 2023 | 2023 |
| 494 | Antonio Callaway | 48101 | WR | CLV | 16 | 14.5 | 3 | 55.1 | 38.5 | 5 | ... | 1683 | 25 | 2 | 2135000 | 1067500 | 0 | 0 | 0 | 2023 | 2024 |
| 495 | Marquise Goodwin | 7859 | WR | SF | 16 | 16.0 | 6 | 56.6 | 50.0 | 11 | ... | 1756 | 32 | 1 | 1700000 | 1700000 | 250000 | 400000 | 400000 | 2023 | 2023 |
| 496 | Trent Taylor | 11932 | WR | SF | 15 | 6.9 | 4 | 72.9 | 66.7 | 8 | ... | 1895 | 28 | 1 | 1232500 | 1232500 | 52500 | 52500 | 52500 | 2023 | 2023 |
497 rows × 56 columns
# get wide receiver data
wr_df = receiving_df[receiving_df['position'] == 'WR']
wr_df
| player | player_id | position | team_name | player_game_count | avg_depth_of_target | avoided_tackles | caught_percent | contested_catch_rate | contested_receptions | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Justin Jefferson | 61398 | WR | MIN | 18 | 10.7 | 11 | 73.0 | 55.0 | 22 | ... | 422 | 21 | 4 | 13122805 | 3280701 | 7103856 | 13122805 | 13122805 | 2020 | 2024 |
| 1 | Justin Jefferson | 61398 | WR | MIN | 17 | 13.3 | 9 | 66.3 | 48.1 | 13 | ... | 422 | 21 | 4 | 13122805 | 3280701 | 7103856 | 13122805 | 13122805 | 2020 | 2024 |
| 2 | Justin Jefferson | 61398 | WR | MIN | 16 | 12.0 | 13 | 72.7 | 54.5 | 12 | ... | 422 | 21 | 4 | 13122805 | 3280701 | 7103856 | 13122805 | 13122805 | 2020 | 2024 |
| 9 | Tyreek Hill | 10799 | WR | MIA | 18 | 12.6 | 12 | 69.6 | 50.0 | 13 | ... | 19 | 28 | 4 | 120000000 | 30000000 | 25500000 | 52535000 | 72200000 | 2022 | 2026 |
| 10 | Tyreek Hill | 10799 | WR | KC | 20 | 10.6 | 15 | 73.6 | 28.6 | 6 | ... | 19 | 28 | 4 | 120000000 | 30000000 | 25500000 | 52535000 | 72200000 | 2022 | 2026 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 486 | Randall Cobb | 6216 | WR | GB | 15 | 6.1 | 5 | 77.6 | 38.5 | 5 | ... | 1279 | 32 | 1 | 3000000 | 3000000 | 250000 | 250000 | 250000 | 2023 | 2023 |
| 488 | Phillip Dorsett | 9462 | WR | NE | 15 | 15.5 | 0 | 50.0 | 36.4 | 4 | ... | 1954 | 30 | 1 | 1217500 | 1217500 | 0 | 0 | 0 | 2023 | 2023 |
| 494 | Antonio Callaway | 48101 | WR | CLV | 16 | 14.5 | 3 | 55.1 | 38.5 | 5 | ... | 1683 | 25 | 2 | 2135000 | 1067500 | 0 | 0 | 0 | 2023 | 2024 |
| 495 | Marquise Goodwin | 7859 | WR | SF | 16 | 16.0 | 6 | 56.6 | 50.0 | 11 | ... | 1756 | 32 | 1 | 1700000 | 1700000 | 250000 | 400000 | 400000 | 2023 | 2023 |
| 496 | Trent Taylor | 11932 | WR | SF | 15 | 6.9 | 4 | 72.9 | 66.7 | 8 | ... | 1895 | 28 | 1 | 1232500 | 1232500 | 52500 | 52500 | 52500 | 2023 | 2023 |
331 rows × 56 columns
# tight end data
te_df = receiving_df[receiving_df['position'] == 'TE']
te_df
| player | player_id | position | team_name | player_game_count | avg_depth_of_target | avoided_tackles | caught_percent | contested_catch_rate | contested_receptions | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Travis Kelce | 7844 | TE | KC | 20 | 7.5 | 25 | 75.3 | 45.5 | 10 | ... | 115 | 30 | 4 | 57250000 | 14312500 | 0 | 20750000 | 22750000 | 2020 | 2025 |
| 4 | Travis Kelce | 7844 | TE | KC | 19 | 7.6 | 23 | 72.8 | 60.0 | 9 | ... | 115 | 30 | 4 | 57250000 | 14312500 | 0 | 20750000 | 22750000 | 2020 | 2025 |
| 5 | Travis Kelce | 7844 | TE | KC | 18 | 8.9 | 16 | 76.4 | 30.4 | 7 | ... | 115 | 30 | 4 | 57250000 | 14312500 | 0 | 20750000 | 22750000 | 2020 | 2025 |
| 6 | Travis Kelce | 7844 | TE | KC | 19 | 9.1 | 18 | 76.3 | 53.8 | 14 | ... | 115 | 30 | 4 | 57250000 | 14312500 | 0 | 20750000 | 22750000 | 2020 | 2025 |
| 7 | Travis Kelce | 7844 | TE | KC | 18 | 9.3 | 11 | 73.4 | 64.0 | 16 | ... | 115 | 30 | 4 | 57250000 | 14312500 | 0 | 20750000 | 22750000 | 2020 | 2025 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 489 | Kyle Rudolph | 6195 | TE | MIN | 17 | 7.3 | 5 | 81.8 | 80.0 | 4 | ... | 1695 | 32 | 1 | 2000000 | 2000000 | 500000 | 2000000 | 2000000 | 2022 | 2022 |
| 490 | Kyle Rudolph | 6195 | TE | MIN | 16 | 7.3 | 4 | 84.2 | 53.8 | 7 | ... | 1695 | 32 | 1 | 2000000 | 2000000 | 500000 | 2000000 | 2000000 | 2022 | 2022 |
| 491 | Kyle Rudolph | 6195 | TE | MIN | 18 | 8.2 | 3 | 72.4 | 38.9 | 7 | ... | 1695 | 32 | 1 | 2000000 | 2000000 | 500000 | 2000000 | 2000000 | 2022 | 2022 |
| 492 | Jordan Akins | 47050 | TE | HST | 16 | 7.4 | 8 | 67.9 | 25.0 | 2 | ... | 1074 | 30 | 2 | 3900000 | 1950000 | 630000 | 1730000 | 1730000 | 2023 | 2024 |
| 493 | O.J. Howard | 11774 | TE | TB | 14 | 10.1 | 2 | 65.4 | 41.7 | 5 | ... | 1895 | 28 | 1 | 1232500 | 1232500 | 76250 | 451250 | 451250 | 2023 | 2023 |
100 rows × 56 columns
Here are the blocking stats. These contain primary stats for offensive lineman which are typically broken down into 3 positions. Center, guard and tackle. We will create three seperate dataframes for this reason.
# offense_blocking players and stats
offense_blocking_dfs = []
# offense_blocking
# 2022
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking.csv'))
offense_blocking_dfs[0]['year'] = 2022
# 2021
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (1).csv'))
offense_blocking_dfs[1]['year'] = 2021
# 2020
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (2).csv'))
offense_blocking_dfs[2]['year'] = 2020
# 2019
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (3).csv'))
offense_blocking_dfs[3]['year'] = 2019
# 2018
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (4).csv'))
offense_blocking_dfs[4]['year'] = 2018
# 2017
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (5).csv'))
offense_blocking_dfs[5]['year'] = 2017
# create dataframe
offense_blocking_df = pd.concat(offense_blocking_dfs)
# remove outliers
offense_blocking_df = offense_blocking_df[offense_blocking_df['snap_counts_block'] > 100]
# merge players with salary
offense_blocking_df = pd.merge(offense_blocking_df, salary_df, on='player')
offense_blocking_df
| player | player_id | position | team_name | player_game_count | block_percent | declined_penalties | franchise_id | grades_offense | grades_pass_block | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Jason Kelce | 6343 | C | PHI | 20 | 100.0 | 1 | 24 | 89.4 | 83.3 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| 1 | Jason Kelce | 6343 | C | PHI | 18 | 100.0 | 0 | 24 | 84.5 | 69.2 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| 2 | Jason Kelce | 6343 | C | PHI | 16 | 100.0 | 0 | 24 | 69.6 | 62.0 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| 3 | Jason Kelce | 6343 | C | PHI | 17 | 100.0 | 1 | 24 | 81.1 | 67.6 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| 4 | Jason Kelce | 6343 | C | PHI | 18 | 100.0 | 0 | 24 | 87.1 | 88.0 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1750 | Brian Hoyer | 5277 | QB | NE | 12 | 34.5 | 0 | 19 | 63.6 | NaN | ... | 836 | 37 | 2 | 4500000 | 2250000 | 875000 | 4205000 | 4205000 | 2023 | 2024 |
| 1751 | Cordarrelle Patterson | 7810 | WR | OAK | 16 | 29.6 | 0 | 23 | 65.5 | 62.4 | ... | 486 | 31 | 2 | 10500000 | 5250000 | 5000000 | 5000000 | 5000000 | 2022 | 2023 |
| 1752 | Blaine Gabbert | 6162 | QB | ARZ | 5 | 33.3 | 0 | 1 | 45.4 | NaN | ... | 1832 | 33 | 1 | 1317500 | 1317500 | 152500 | 1092500 | 1092500 | 2023 | 2023 |
| 1753 | Trent Taylor | 11932 | WR | SF | 15 | 22.5 | 0 | 28 | 68.4 | 69.6 | ... | 1895 | 28 | 1 | 1232500 | 1232500 | 52500 | 52500 | 52500 | 2023 | 2023 |
| 1754 | Derrick Henry | 10679 | HB | TEN | 18 | 20.5 | 0 | 31 | 73.0 | 76.5 | ... | 142 | 26 | 4 | 50000000 | 12500000 | 12000000 | 25500000 | 25500000 | 2020 | 2023 |
1755 rows × 42 columns
# center data
c_df = offense_blocking_df[offense_blocking_df['position'] == 'C']
c_df
| player | player_id | position | team_name | player_game_count | block_percent | declined_penalties | franchise_id | grades_offense | grades_pass_block | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Jason Kelce | 6343 | C | PHI | 20 | 100.0 | 1 | 24 | 89.4 | 83.3 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| 1 | Jason Kelce | 6343 | C | PHI | 18 | 100.0 | 0 | 24 | 84.5 | 69.2 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| 2 | Jason Kelce | 6343 | C | PHI | 16 | 100.0 | 0 | 24 | 69.6 | 62.0 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| 3 | Jason Kelce | 6343 | C | PHI | 17 | 100.0 | 1 | 24 | 81.1 | 67.6 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| 4 | Jason Kelce | 6343 | C | PHI | 18 | 100.0 | 0 | 24 | 87.1 | 88.0 | ... | 389 | 35 | 1 | 14250000 | 14250000 | 10335000 | 14250000 | 14250000 | 2023 | 2023 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1564 | Michael Deiter | 38554 | C | MIA | 8 | 100.0 | 1 | 17 | 60.6 | 57.2 | ... | 1895 | 26 | 1 | 1232500 | 1232500 | 125000 | 125000 | 125000 | 2023 | 2023 |
| 1570 | Kyle Fuller | 11998 | C | SEA | 10 | 100.0 | 0 | 29 | 46.5 | 31.3 | ... | 1644 | 30 | 1 | 2500000 | 2500000 | 1380000 | 2500000 | 2500000 | 2022 | 2022 |
| 1618 | Trey Hill | 81653 | C | CIN | 7 | 100.0 | 1 | 7 | 53.3 | 65.3 | ... | 1180 | 21 | 4 | 3672352 | 918088 | 192352 | 192352 | 192352 | 2021 | 2024 |
| 1656 | Trystan Colon-Castillo | 41117 | C | BLT | 12 | 98.0 | 0 | 3 | 62.6 | 74.5 | ... | 1786 | 24 | 1 | 1660000 | 1660000 | 250000 | 250000 | 250000 | 2023 | 2023 |
| 1657 | Trystan Colon-Castillo | 41117 | C | BLT | 2 | 100.0 | 0 | 3 | 67.6 | 70.0 | ... | 1786 | 24 | 1 | 1660000 | 1660000 | 250000 | 250000 | 250000 | 2023 | 2023 |
127 rows × 42 columns
# guard data
g_df = offense_blocking_df[offense_blocking_df['position'] == 'G']
g_df
| player | player_id | position | team_name | player_game_count | block_percent | declined_penalties | franchise_id | grades_offense | grades_pass_block | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | Isaac Seumalo | 10713 | G | PHI | 20 | 100.0 | 3 | 24 | 72.7 | 78.2 | ... | 257 | 29 | 3 | 24000000 | 8000000 | 6950000 | 6950000 | 6950000 | 2023 | 2025 |
| 7 | Isaac Seumalo | 10713 | G | PHI | 3 | 100.0 | 0 | 24 | 74.8 | 82.9 | ... | 257 | 29 | 3 | 24000000 | 8000000 | 6950000 | 6950000 | 6950000 | 2023 | 2025 |
| 8 | Isaac Seumalo | 10713 | G | PHI | 9 | 100.0 | 0 | 24 | 62.4 | 77.4 | ... | 257 | 29 | 3 | 24000000 | 8000000 | 6950000 | 6950000 | 6950000 | 2023 | 2025 |
| 9 | Isaac Seumalo | 10713 | G | PHI | 17 | 100.0 | 0 | 24 | 70.6 | 70.1 | ... | 257 | 29 | 3 | 24000000 | 8000000 | 6950000 | 6950000 | 6950000 | 2023 | 2025 |
| 10 | Isaac Seumalo | 10713 | G | PHI | 15 | 99.8 | 1 | 24 | 67.3 | 64.4 | ... | 257 | 29 | 3 | 24000000 | 8000000 | 6950000 | 6950000 | 6950000 | 2023 | 2025 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1686 | Shane Lemieux | 28149 | G | NYG | 11 | 100.0 | 2 | 21 | 32.2 | 16.9 | ... | 1194 | 23 | 4 | 3642140 | 910535 | 347140 | 347140 | 347140 | 2020 | 2023 |
| 1689 | Colton McKivitz | 29314 | G | SF | 9 | 100.0 | 0 | 28 | 53.1 | 49.9 | ... | 822 | 26 | 2 | 4610000 | 2305000 | 730000 | 1870000 | 1870000 | 2023 | 2024 |
| 1690 | Chris Hubbard | 8051 | G | CLV | 8 | 100.0 | 0 | 8 | 71.9 | 78.3 | ... | 1346 | 30 | 1 | 2622500 | 2622500 | 152500 | 2622500 | 2622500 | 2022 | 2022 |
| 1699 | Aaron Stinnie | 28653 | G | TB | 6 | 100.0 | 0 | 30 | 57.6 | 41.9 | ... | 1821 | 28 | 1 | 1400000 | 1400000 | 0 | 150000 | 150000 | 2023 | 2023 |
| 1721 | Nick Harris | 46752 | G | CLV | 2 | 100.0 | 0 | 8 | 51.3 | 34.0 | ... | 1207 | 21 | 4 | 3609648 | 902412 | 314648 | 314648 | 314648 | 2020 | 2023 |
336 rows × 42 columns
# tackle data
t_df = offense_blocking_df[offense_blocking_df['position'] == 'T']
t_df
| player | player_id | position | team_name | player_game_count | block_percent | declined_penalties | franchise_id | grades_offense | grades_pass_block | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | Orlando Brown Jr. | 46227 | T | KC | 20 | 100.0 | 1 | 16 | 75.4 | 76.8 | ... | 93 | 26 | 4 | 64092000 | 16023000 | 31100000 | 31100000 | 31100000 | 2023 | 2026 |
| 15 | Orlando Brown Jr. | 46227 | T | KC | 19 | 100.0 | 3 | 16 | 75.4 | 74.4 | ... | 93 | 26 | 4 | 64092000 | 16023000 | 31100000 | 31100000 | 31100000 | 2023 | 2026 |
| 16 | Orlando Brown Jr. | 46227 | T | BLT | 18 | 100.0 | 0 | 3 | 77.8 | 77.8 | ... | 93 | 26 | 4 | 64092000 | 16023000 | 31100000 | 31100000 | 31100000 | 2023 | 2026 |
| 17 | Orlando Brown Jr. | 46227 | T | BLT | 17 | 100.0 | 2 | 3 | 73.6 | 83.3 | ... | 93 | 26 | 4 | 64092000 | 16023000 | 31100000 | 31100000 | 31100000 | 2023 | 2026 |
| 18 | Orlando Brown Jr. | 46227 | T | BLT | 16 | 100.0 | 0 | 3 | 68.6 | 75.1 | ... | 93 | 26 | 4 | 64092000 | 16023000 | 31100000 | 31100000 | 31100000 | 2023 | 2026 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1705 | Kendall Lamm | 10017 | T | CLV | 10 | 98.4 | 1 | 8 | 66.7 | 69.5 | ... | 1832 | 30 | 1 | 1317500 | 1317500 | 100000 | 200000 | 200000 | 2023 | 2023 |
| 1706 | Kendall Lamm | 10017 | T | HST | 15 | 100.0 | 0 | 13 | 65.2 | 76.0 | ... | 1832 | 30 | 1 | 1317500 | 1317500 | 100000 | 200000 | 200000 | 2023 | 2023 |
| 1707 | Kendall Lamm | 10017 | T | HST | 6 | 100.0 | 0 | 13 | 58.4 | 50.2 | ... | 1832 | 30 | 1 | 1317500 | 1317500 | 100000 | 200000 | 200000 | 2023 | 2023 |
| 1742 | Ja'Wuan James | 8654 | T | MIA | 15 | 99.9 | 4 | 17 | 72.4 | 67.4 | ... | 831 | 29 | 2 | 4575000 | 2287500 | 500000 | 500000 | 500000 | 2021 | 2022 |
| 1743 | Ja'Wuan James | 8654 | T | MIA | 8 | 100.0 | 0 | 17 | 69.0 | 77.8 | ... | 831 | 29 | 2 | 4575000 | 2287500 | 500000 | 500000 | 500000 | 2021 | 2022 |
373 rows × 42 columns
On the defensive side we have to do different things with our dataframes. Instead of one category containing information for multiple positions in this case our positions contain information we want from multiple categories. For the interior defensive line and edge rushers, we want stats from our run defense and pass rushing dataframes. For linebackers, we want coverage and run defense statistics. To do this we will merge on positions and player names so that we can split our data up accordingly.
# run_defense players and stats
run_defense_dfs = []
# run_defense
# 2022
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary.csv'))
run_defense_dfs[0]['year'] = 2022
# 2021
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (1).csv'))
run_defense_dfs[1]['year'] = 2021
# 2020
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (2).csv'))
run_defense_dfs[2]['year'] = 2020
# 2019
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (3).csv'))
run_defense_dfs[3]['year'] = 2019
# 2018
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (4).csv'))
run_defense_dfs[4]['year'] = 2018
# 2017
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (5).csv'))
run_defense_dfs[5]['year'] = 2017
# create dataframe
run_defense_df = pd.concat(run_defense_dfs)
# remove outliers
run_defense_df = run_defense_df[run_defense_df['snap_counts_run'] > 100]
# merge players with salary
run_defense_df = pd.merge(run_defense_df, salary_df, on='player')
run_defense_df
| player | player_id | position | team_name | player_game_count | assists | avg_depth_of_tackle | declined_penalties | forced_fumbles | franchise_id | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Calais Campbell | 4364 | DI | BLT | 15 | 11 | 3.5 | 1 | 0 | 3 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| 1 | Calais Campbell | 4364 | DI | BLT | 15 | 12 | 2.1 | 1 | 0 | 3 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| 2 | Calais Campbell | 4364 | DI | BLT | 14 | 5 | 1.5 | 0 | 0 | 3 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| 3 | Calais Campbell | 4364 | DI | JAX | 16 | 10 | 1.5 | 0 | 2 | 15 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| 4 | Calais Campbell | 4364 | ED | JAX | 16 | 10 | 0.8 | 0 | 1 | 15 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1848 | Jason Cabinda | 50714 | LB | OAK | 10 | 3 | 5.6 | 0 | 0 | 23 | ... | 953 | 25 | 2 | 4100000 | 2050000 | 1000000 | 2035000 | 2035000 | 2022 | 2023 |
| 1849 | Leonard Johnson | 7385 | CB | BUF | 16 | 1 | 3.8 | 1 | 0 | 4 | ... | 1335 | 24 | 3 | 2695000 | 898333 | 0 | 0 | 0 | 2023 | 2025 |
| 1850 | Justin Bethel | 7654 | CB | ARZ | 9 | 4 | 5.2 | 0 | 0 | 1 | ... | 1832 | 32 | 1 | 1317500 | 1317500 | 75000 | 75000 | 75000 | 2023 | 2023 |
| 1851 | Cody Davis | 8458 | S | LA | 7 | 4 | 8.6 | 0 | 0 | 26 | ... | 1682 | 33 | 1 | 2200000 | 2200000 | 0 | 500000 | 500000 | 2023 | 2023 |
| 1852 | Miles Killebrew | 10745 | S | DET | 14 | 7 | 5.8 | 0 | 0 | 11 | ... | 986 | 28 | 2 | 4000000 | 2000000 | 965000 | 965000 | 965000 | 2022 | 2023 |
1853 rows × 35 columns
# pass_rush players and stats
pass_rush_dfs = []
# pass_rush
# 2022
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary.csv'))
pass_rush_dfs[0]['year'] = 2022
# 2021
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (1).csv'))
pass_rush_dfs[1]['year'] = 2021
# 2020
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (2).csv'))
pass_rush_dfs[2]['year'] = 2020
# 2019
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (3).csv'))
pass_rush_dfs[3]['year'] = 2019
# 2018
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (4).csv'))
pass_rush_dfs[4]['year'] = 2018
# 2017
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (5).csv'))
pass_rush_dfs[5]['year'] = 2017
# create dataframe
pass_rush_df = pd.concat(pass_rush_dfs)
# remove outliers
pass_rush_df = pass_rush_df[pass_rush_df['snap_counts_pass_rush'] > 100]
# merge players with salary
pass_rush_df = pd.merge(pass_rush_df, salary_df, on='player')
pass_rush_df
| player | player_id | position | team_name | player_game_count | batted_passes | declined_penalties | franchise_id | grades_pass_rush_defense | hits | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Calais Campbell | 4364 | DI | BLT | 15 | 2 | 1 | 3 | 71.7 | 12 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| 1 | Calais Campbell | 4364 | DI | BLT | 15 | 1 | 1 | 3 | 69.5 | 11 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| 2 | Calais Campbell | 4364 | DI | BLT | 14 | 6 | 0 | 3 | 73.0 | 4 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| 3 | Calais Campbell | 4364 | DI | JAX | 16 | 1 | 0 | 15 | 81.4 | 19 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| 4 | Calais Campbell | 4364 | ED | JAX | 16 | 2 | 0 | 15 | 70.4 | 11 | ... | 601 | 36 | 1 | 7000000 | 7000000 | 3000000 | 7000000 | 7000000 | 2023 | 2023 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 956 | C.J. Mosley | 8652 | LB | BLT | 16 | 1 | 0 | 3 | 57.3 | 1 | ... | 47 | 26 | 5 | 85000000 | 17000000 | 7500000 | 43000000 | 51000000 | 2019 | 2024 |
| 957 | Jarrad Davis | 11776 | LB | DET | 16 | 0 | 0 | 11 | 74.5 | 3 | ... | 1980 | 28 | 1 | 1180000 | 1180000 | 0 | 0 | 0 | 2023 | 2023 |
| 958 | Jayon Brown | 11910 | LB | TEN | 16 | 0 | 0 | 31 | 86.6 | 4 | ... | 1890 | 27 | 1 | 1250000 | 1250000 | 125000 | 625000 | 625000 | 2022 | 2022 |
| 959 | Deadrin Senat | 48815 | DI | ATL | 15 | 0 | 0 | 2 | 68.7 | 2 | ... | 1895 | 28 | 1 | 1232500 | 1232500 | 152500 | 652500 | 652500 | 2023 | 2023 |
| 960 | Christian Kirksey | 8706 | LB | CLV | 16 | 0 | 0 | 8 | 56.7 | 2 | ... | 492 | 29 | 2 | 10000000 | 5000000 | 2000000 | 4000000 | 4000000 | 2022 | 2023 |
961 rows × 45 columns
# defense_coverage players and stats
defense_coverage_dfs = []
# defense_coverage
# 2022
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary.csv'))
defense_coverage_dfs[0]['year'] = 2022
# 2021
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (1).csv'))
defense_coverage_dfs[1]['year'] = 2021
# 2020
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (2).csv'))
defense_coverage_dfs[2]['year'] = 2020
# 2019
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (3).csv'))
defense_coverage_dfs[3]['year'] = 2019
# 2018
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (4).csv'))
defense_coverage_dfs[4]['year'] = 2018
# 2017
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (5).csv'))
defense_coverage_dfs[5]['year'] = 2017
# create dataframe
defense_coverage_df = pd.concat(defense_coverage_dfs)
# remove outliers
defense_coverage_df = defense_coverage_df[defense_coverage_df['snap_counts_coverage'] > 100]
# merge players with salary
defense_coverage_df = pd.merge(defense_coverage_df, salary_df, on='player')
defense_coverage_df
| player | player_id | position | team_name | player_game_count | assists | avg_depth_of_target | catch_rate | coverage_percent | coverage_snaps_per_reception | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Patrick Peterson | 6157 | CB | MIN | 18 | 2 | 12.1 | 60.0 | 99.3 | 14.6 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| 1 | Patrick Peterson | 6157 | CB | MIN | 13 | 3 | 10.0 | 57.8 | 100.0 | 13.9 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| 2 | Patrick Peterson | 6157 | CB | ARZ | 16 | 2 | 12.1 | 66.7 | 99.4 | 12.8 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| 3 | Patrick Peterson | 6157 | CB | ARZ | 10 | 3 | 10.4 | 67.2 | 99.3 | 10.8 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| 4 | Patrick Peterson | 6157 | CB | ARZ | 16 | 2 | 11.7 | 57.7 | 99.5 | 19.2 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1124 | Tony Brown | 49255 | CB | GB | 10 | 1 | 10.9 | 63.2 | 100.0 | 7.0 | ... | 1895 | 26 | 1 | 1232500 | 1232500 | 152500 | 500000 | 500000 | 2023 | 2023 |
| 1125 | Leonard Johnson | 7385 | CB | BUF | 16 | 2 | 6.3 | 71.1 | 96.6 | 7.2 | ... | 1335 | 24 | 3 | 2695000 | 898333 | 0 | 0 | 0 | 2023 | 2025 |
| 1126 | Cody Davis | 8458 | S | LA | 7 | 0 | 12.4 | 22.2 | 98.3 | 82.0 | ... | 1682 | 33 | 1 | 2200000 | 2200000 | 0 | 500000 | 500000 | 2023 | 2023 |
| 1127 | Bud Dupree | 9455 | ED | PIT | 16 | 0 | 6.9 | 69.2 | 27.1 | 14.4 | ... | 1279 | 30 | 1 | 3000000 | 3000000 | 1250000 | 2490000 | 2490000 | 2023 | 2023 |
| 1128 | Miles Killebrew | 10745 | S | DET | 15 | 2 | 11.8 | 64.0 | 94.0 | 13.3 | ... | 986 | 28 | 2 | 4000000 | 2000000 | 965000 | 965000 | 965000 | 2022 | 2023 |
1129 rows × 51 columns
idl_df = pd.merge(run_defense_df[run_defense_df['position'] == 'DI'], pass_rush_df[pass_rush_df['position'] == 'DI'])
idl_df
| player | player_id | position | team_name | player_game_count | assists | avg_depth_of_tackle | declined_penalties | forced_fumbles | franchise_id | ... | true_pass_set_hurries | true_pass_set_pass_rush_opp | true_pass_set_pass_rush_percent | true_pass_set_pass_rush_win_rate | true_pass_set_pass_rush_wins | true_pass_set_prp | true_pass_set_sacks | true_pass_set_snap_counts_pass_play | true_pass_set_snap_counts_pass_rush | true_pass_set_total_pressures | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Calais Campbell | 4364 | DI | BLT | 15 | 11 | 3.5 | 1 | 0 | 3 | ... | 13 | 183 | 98.5 | 21.3 | 39 | 9.0 | 6 | 196 | 193 | 27 |
| 1 | Calais Campbell | 4364 | DI | BLT | 15 | 12 | 2.1 | 1 | 0 | 3 | ... | 15 | 183 | 100.0 | 15.3 | 28 | 5.7 | 0 | 194 | 194 | 21 |
| 2 | Calais Campbell | 4364 | DI | BLT | 14 | 5 | 1.5 | 0 | 0 | 3 | ... | 12 | 131 | 98.0 | 13.7 | 18 | 6.5 | 2 | 151 | 148 | 15 |
| 3 | Calais Campbell | 4364 | DI | JAX | 16 | 10 | 1.5 | 0 | 2 | 15 | ... | 28 | 218 | 100.0 | 22.9 | 50 | 11.2 | 5 | 238 | 238 | 44 |
| 4 | Tyson Alualu | 5535 | DI | PIT | 17 | 2 | 2.6 | 0 | 0 | 25 | ... | 6 | 72 | 100.0 | 9.7 | 7 | 5.6 | 0 | 75 | 75 | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 429 | L.J. Collier | 48822 | DI | SEA | 17 | 5 | 1.4 | 0 | 0 | 29 | ... | 6 | 97 | 99.0 | 14.4 | 14 | 6.7 | 3 | 101 | 100 | 10 |
| 430 | Daniel Ekuale | 48874 | DI | JAX | 9 | 7 | 3.0 | 0 | 0 | 15 | ... | 0 | 29 | 100.0 | 3.4 | 1 | 1.7 | 0 | 32 | 32 | 1 |
| 431 | Sheldon Day | 10737 | DI | SF | 19 | 3 | 1.9 | 0 | 0 | 28 | ... | 2 | 48 | 100.0 | 6.3 | 3 | 4.2 | 1 | 50 | 50 | 3 |
| 432 | Sheldon Day | 10737 | DI | SF | 12 | 2 | 3.5 | 0 | 0 | 28 | ... | 8 | 60 | 100.0 | 18.3 | 11 | 10.0 | 2 | 61 | 61 | 10 |
| 433 | Sheldon Day | 10737 | DI | SF | 12 | 4 | 1.9 | 0 | 0 | 28 | ... | 1 | 59 | 100.0 | 8.5 | 5 | 4.2 | 2 | 68 | 68 | 3 |
434 rows × 60 columns
edge_df = pd.merge(run_defense_df[run_defense_df['position'] == 'ED'], pass_rush_df[pass_rush_df['position'] == 'ED'])
edge_df
| player | player_id | position | team_name | player_game_count | assists | avg_depth_of_tackle | declined_penalties | forced_fumbles | franchise_id | ... | true_pass_set_hurries | true_pass_set_pass_rush_opp | true_pass_set_pass_rush_percent | true_pass_set_pass_rush_win_rate | true_pass_set_pass_rush_wins | true_pass_set_prp | true_pass_set_sacks | true_pass_set_snap_counts_pass_play | true_pass_set_snap_counts_pass_rush | true_pass_set_total_pressures | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Calais Campbell | 4364 | ED | JAX | 16 | 10 | 0.8 | 0 | 1 | 15 | ... | 18 | 204 | 100.0 | 17.6 | 36 | 10.8 | 11 | 223 | 223 | 33 |
| 1 | Calais Campbell | 4364 | ED | JAX | 19 | 13 | 1.9 | 0 | 1 | 15 | ... | 35 | 321 | 100.0 | 24.9 | 80 | 12.3 | 15 | 344 | 344 | 64 |
| 2 | Brandon Graham | 5538 | ED | PHI | 20 | 2 | 1.1 | 0 | 0 | 24 | ... | 21 | 144 | 96.8 | 31.3 | 45 | 16.0 | 12 | 158 | 153 | 34 |
| 3 | Brandon Graham | 5538 | ED | PHI | 16 | 6 | 1.7 | 0 | 0 | 24 | ... | 20 | 140 | 96.8 | 27.1 | 38 | 12.5 | 5 | 156 | 151 | 30 |
| 4 | Brandon Graham | 5538 | ED | PHI | 17 | 6 | 0.5 | 0 | 1 | 24 | ... | 26 | 235 | 97.3 | 20.4 | 48 | 10.4 | 8 | 258 | 251 | 41 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 350 | Isaac Rochell | 11980 | ED | LAC | 18 | 2 | 2.4 | 0 | 0 | 27 | ... | 16 | 193 | 100.0 | 14.0 | 27 | 7.5 | 6 | 210 | 210 | 23 |
| 351 | Takkarist McKinley | 11781 | ED | ATL | 14 | 6 | 2.9 | 0 | 0 | 2 | ... | 12 | 99 | 91.7 | 20.2 | 20 | 10.6 | 2 | 120 | 110 | 19 |
| 352 | Takkarist McKinley | 11781 | ED | ATL | 15 | 3 | 3.2 | 0 | 0 | 2 | ... | 24 | 193 | 99.1 | 20.7 | 40 | 11.1 | 6 | 217 | 215 | 37 |
| 353 | Takkarist McKinley | 11781 | ED | ATL | 18 | 4 | 0.4 | 1 | 0 | 2 | ... | 23 | 148 | 99.4 | 23.0 | 34 | 13.5 | 7 | 159 | 158 | 33 |
| 354 | Jordan Willis | 11828 | ED | CIN | 16 | 3 | 1.8 | 0 | 0 | 7 | ... | 10 | 138 | 100.0 | 13.8 | 19 | 5.8 | 2 | 148 | 148 | 14 |
355 rows × 60 columns
lb_df = pd.merge(run_defense_df[run_defense_df['position'] == 'LB'], defense_coverage_df[defense_coverage_df['position'] == 'LB'])
lb_df
| player | player_id | position | team_name | player_game_count | assists | avg_depth_of_tackle | declined_penalties | forced_fumbles | franchise_id | ... | qb_rating_against | receptions | snap_counts_coverage | snap_counts_pass_play | targets | touchdowns | yards | yards_after_catch | yards_per_coverage_snap | yards_per_reception |
|---|
0 rows × 56 columns
For cornerbacks and safeties, we only need to worry about coverage, so we will split it the same way we did for offensive players.
# cornerbacks
cb_df = defense_coverage_df[defense_coverage_df['position'] == 'CB']
cb_df
| player | player_id | position | team_name | player_game_count | assists | avg_depth_of_target | catch_rate | coverage_percent | coverage_snaps_per_reception | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Patrick Peterson | 6157 | CB | MIN | 18 | 2 | 12.1 | 60.0 | 99.3 | 14.6 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| 1 | Patrick Peterson | 6157 | CB | MIN | 13 | 3 | 10.0 | 57.8 | 100.0 | 13.9 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| 2 | Patrick Peterson | 6157 | CB | ARZ | 16 | 2 | 12.1 | 66.7 | 99.4 | 12.8 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| 3 | Patrick Peterson | 6157 | CB | ARZ | 10 | 3 | 10.4 | 67.2 | 99.3 | 10.8 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| 4 | Patrick Peterson | 6157 | CB | ARZ | 16 | 2 | 11.7 | 57.7 | 99.5 | 19.2 | ... | 398 | 32 | 2 | 14000000 | 7000000 | 5850000 | 5850000 | 5850000 | 2023 | 2024 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1106 | Michael Ojemudia | 50327 | CB | DEN | 14 | 2 | 13.7 | 68.5 | 99.5 | 10.5 | ... | 812 | 22 | 4 | 4751392 | 1187848 | 1015560 | 1015560 | 1015560 | 2020 | 2023 |
| 1109 | Lamar Jackson | 50532 | CB | NYJ | 8 | 2 | 12.1 | 66.7 | 100.0 | 9.8 | ... | 2 | 26 | 5 | 260000000 | 52000000 | 72500000 | 135000000 | 185000000 | 2023 | 2027 |
| 1116 | Siran Neal | 27130 | CB | BUF | 7 | 1 | 5.6 | 72.2 | 91.8 | 9.0 | ... | 519 | 27 | 3 | 9000000 | 3000000 | 1600000 | 2900000 | 3300000 | 2022 | 2024 |
| 1124 | Tony Brown | 49255 | CB | GB | 10 | 1 | 10.9 | 63.2 | 100.0 | 7.0 | ... | 1895 | 26 | 1 | 1232500 | 1232500 | 152500 | 500000 | 500000 | 2023 | 2023 |
| 1125 | Leonard Johnson | 7385 | CB | BUF | 16 | 2 | 6.3 | 71.1 | 96.6 | 7.2 | ... | 1335 | 24 | 3 | 2695000 | 898333 | 0 | 0 | 0 | 2023 | 2025 |
441 rows × 51 columns
# Safeties
s_df = defense_coverage_df[defense_coverage_df['position'] == 'S']
s_df
| player | player_id | position | team_name | player_game_count | assists | avg_depth_of_target | catch_rate | coverage_percent | coverage_snaps_per_reception | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30 | Harrison Smith | 7641 | S | MIN | 15 | 5 | 10.0 | 80.0 | 97.8 | 16.4 | ... | 96 | 32 | 4 | 64000000 | 16000000 | 9579410 | 14179410 | 26379410 | 2021 | 2025 |
| 31 | Harrison Smith | 7641 | S | MIN | 15 | 9 | 10.1 | 74.1 | 93.5 | 28.9 | ... | 96 | 32 | 4 | 64000000 | 16000000 | 9579410 | 14179410 | 26379410 | 2021 | 2025 |
| 32 | Harrison Smith | 7641 | S | MIN | 16 | 10 | 15.6 | 65.4 | 93.9 | 31.3 | ... | 96 | 32 | 4 | 64000000 | 16000000 | 9579410 | 14179410 | 26379410 | 2021 | 2025 |
| 33 | Harrison Smith | 7641 | S | MIN | 17 | 10 | 9.3 | 66.7 | 94.4 | 25.3 | ... | 96 | 32 | 4 | 64000000 | 16000000 | 9579410 | 14179410 | 26379410 | 2021 | 2025 |
| 34 | Harrison Smith | 7641 | S | MIN | 16 | 1 | 12.4 | 66.7 | 92.7 | 22.2 | ... | 96 | 32 | 4 | 64000000 | 16000000 | 9579410 | 14179410 | 26379410 | 2021 | 2025 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1111 | Michael Thomas | 7279 | S | NYG | 16 | 4 | 6.8 | 82.6 | 90.6 | 9.7 | ... | 1983 | 32 | 1 | 1165000 | 1165000 | 0 | 0 | 0 | 2023 | 2023 |
| 1112 | Michael Thomas | 7279 | S | NYG | 16 | 3 | 6.8 | 63.6 | 92.6 | 14.1 | ... | 492 | 30 | 1 | 10000000 | 10000000 | 5000000 | 10000000 | 10000000 | 2023 | 2024 |
| 1113 | Michael Thomas | 7279 | S | NYG | 16 | 3 | 6.8 | 63.6 | 92.6 | 14.1 | ... | 1983 | 32 | 1 | 1165000 | 1165000 | 0 | 0 | 0 | 2023 | 2023 |
| 1126 | Cody Davis | 8458 | S | LA | 7 | 0 | 12.4 | 22.2 | 98.3 | 82.0 | ... | 1682 | 33 | 1 | 2200000 | 2200000 | 0 | 500000 | 500000 | 2023 | 2023 |
| 1128 | Miles Killebrew | 10745 | S | DET | 15 | 2 | 11.8 | 64.0 | 94.0 | 13.3 | ... | 986 | 28 | 2 | 4000000 | 2000000 | 965000 | 965000 | 965000 | 2022 | 2023 |
344 rows × 51 columns
Now we need to explore our data and understand it. This is where we analyze the features of our dataset and start to determine which ones are more valueable for our hypothesis. To do this we will use single vector decomposition. We will import svds from the scipy linear algebra library and apply it to our data. We need to first remove any irrelevant data from ourdatasets such as player_id, franchise_id and team and handle any missing data. For missing data we replaced it with the mean from that column. Since pretty much all missing data has already been dropped this will not affect the analysis here much.
import numpy as np
from scipy.sparse.linalg import svds
# passing_df = passing_df.fillna(passing_df.mean())
numeric_cols = passing_df.select_dtypes(include=[np.number]).columns.tolist()
passing_df[numeric_cols] = passing_df[numeric_cols].replace(-1, passing_df[numeric_cols].mean()).fillna(passing_df[numeric_cols].mean())
# Extract the matrix of numerical features you want to perform SVD on
features = passing_df.select_dtypes(include=[np.number])
features = features.drop([ 'year',
'Signed Age', 'Yrs', 'Value', 'AAV', 'Sign Bonus', "G'teed @ Sign",
"Practical G'teed", 'player_id', 'franchise_id', 'Rank', 'start_year', 'end_year', 'attempts', 'passing_snaps'], axis=1)
cols = len(features.columns) -1
features
# Perform SVD on the matrix
U, S, Vt = svds(features.to_numpy(), k=cols)
# Check the shape of the SVD factors
Vt[cols-1]
features.shape
covariance = np.dot(features.to_numpy().T, features.to_numpy())
covariance.shape
U, S, Vt = svds(covariance, k=cols)
print('Shape of U:', U.shape)
print('Shape of S:', S.shape)
print('Shape of Vt:', Vt.shape)
Shape of U: (35, 34) Shape of S: (34,) Shape of Vt: (34, 35)
These values indicate how much correlation there are between all the data. Each element in the vector represents its respective column. If the value is closer to 0 that means there is high correlation between that feature and the dataset. Now we will find and list out the features themselves in order from most correlated to least to give us a good picture of the value for our features.
print(S)
print(Vt[cols-2])
print(features.columns)
[5.32132048e+00 1.95195817e+01 2.21593767e+01 2.69058572e+01
3.08601982e+01 3.25226529e+01 4.34286074e+01 1.25780933e+02
1.65291827e+02 2.07045581e+02 3.03147850e+02 3.95140046e+02
4.27939212e+02 5.63524917e+02 7.81887424e+02 1.16018319e+03
1.28718542e+03 1.84625105e+03 2.97174001e+03 3.64043634e+03
4.15132218e+03 6.67026650e+03 7.24450580e+03 9.67151064e+03
1.20079857e+04 1.65821938e+04 2.13225843e+04 2.93136814e+04
4.74373874e+04 5.25951189e+04 1.52070323e+05 5.24450674e+05
1.16167356e+06 2.21986736e+09]
[-2.34379971e-02 -3.21159966e-01 -3.07499708e-01 -3.80265072e-02
-1.31386273e-02 -1.01691087e-02 2.17564836e-02 -1.07714363e-02
-2.71265983e-01 -1.30782715e-01 -1.84448929e-04 -2.35771655e-01
-3.68752391e-02 -4.75904584e-01 -2.26877224e-02 -1.49394939e-02
-2.54390389e-01 -2.41356439e-01 -2.30091387e-01 -2.94629687e-01
-9.20545531e-03 -1.64021726e-02 -6.77729132e-03 -1.03144641e-01
-3.27472015e-01 -3.78441921e-02 -7.04074693e-02 -5.86298697e-02
-4.35628176e-03 -1.32807256e-02 2.70608404e-02 -2.86607717e-02
-1.84800500e-02 1.75919108e-01 -2.68586227e-02]
Index(['player_game_count', 'accuracy_percent', 'aimed_passes',
'avg_depth_of_target', 'avg_time_to_throw', 'bats', 'big_time_throws',
'btt_rate', 'completion_percent', 'completions', 'declined_penalties',
'def_gen_pressures', 'drop_rate', 'dropbacks', 'drops', 'first_downs',
'grades_hands_fumble', 'grades_offense', 'grades_pass', 'grades_run',
'hit_as_threw', 'interceptions', 'penalties', 'pressure_to_sack_rate',
'qb_rating', 'sack_percent', 'sacks', 'scrambles', 'spikes',
'thrown_aways', 'touchdowns', 'turnover_worthy_plays', 'twp_rate',
'yards', 'ypa'],
dtype='object')
We want to plot our eigenvalues to get an idea of which vectors tell us the most about our data.
import matplotlib.pyplot as plt
plt.scatter([i for i in range(len(S))], np.log(S))
plt.title("Eigenvalues")
plt.xlabel("index")
plt.ylabel('log(value)')
Text(0, 0.5, 'log(value)')
Now we want to sort the last vector in the Vt matrix in order to vizualise the variability of our features. The larger values indicate stronger variability. With this we must also sort the 3 before that. As seen in the graph above the last 4 dots are growing at a much faster rate than the rest of the dots. We want to visualize those 4 vectors but we also must keep the sorting consistent so we will store all the values in vectors at the same time to not lose order of our features.
sorted_feature_weights1 = sorted(Vt[cols-1], reverse=True)
sorted_feature_weights2 = []
sorted_feature_weights3 = []
sorted_feature_weights4 = []
sorted_features = []
for i, w in enumerate(sorted_feature_weights1):
fInd = np.where(Vt[cols-1] == w)[0][0]
print(f'{i+1}: {list(features.columns)[fInd]}, {w}')
sorted_features.append(list(features.columns)[fInd])
sorted_feature_weights2.append(Vt[cols-2][fInd])
sorted_feature_weights3.append(Vt[cols-3][fInd])
sorted_feature_weights4.append(Vt[cols-4][fInd])
1: declined_penalties, -5.1902476351690364e-05 2: spikes, -0.0005040851831544245 3: avg_time_to_throw, -0.0007336287756775956 4: twp_rate, -0.0008677507881194244 5: penalties, -0.0010294444872858493 6: btt_rate, -0.0012020187441530163 7: hit_as_threw, -0.001488683637183896 8: sack_percent, -0.0016230606162863943 9: drop_rate, -0.0017654610797555451 10: ypa, -0.0019369059513612257 11: bats, -0.0022662809650424204 12: avg_depth_of_target, -0.00227007044048473 13: interceptions, -0.002789288795920271 14: player_game_count, -0.003960862305946379 15: pressure_to_sack_rate, -0.004807589673502431 16: turnover_worthy_plays, -0.005019104882233476 17: thrown_aways, -0.0058182865248409075 18: drops, -0.006068845224694851 19: touchdowns, -0.006455059194627049 20: big_time_throws, -0.006606178892858496 21: scrambles, -0.007147602683265029 22: sacks, -0.008843311923979863 23: grades_hands_fumble, -0.015984998460563206 24: completion_percent, -0.01703716344222998 25: grades_run, -0.017993910644480807 26: grades_pass, -0.01950372002569472 27: accuracy_percent, -0.019768600696350242 28: grades_offense, -0.020069424039326364 29: qb_rating, -0.024586800717168903 30: def_gen_pressures, -0.04979298030384959 31: first_downs, -0.050217903281132084 32: completions, -0.08555131303500313 33: aimed_passes, -0.12160033610316452 34: dropbacks, -0.14797675341032865 35: yards, -0.9736348714320443
Now we will put everything together in one graph. We will show a scatter diagram from each of the 4 vectors. We will display the values for each feature individually. This will give us a good idea of the variability of the features so that we can understand which ones will be useful when trying to predict salary and which ones will tell us things we already know and don't need to be trained on at the risk of overfitting.
# scatter plot for all of the vectors
x = [i for i in range(35)]
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(sorted_feature_weights1[::-1], x, color='black')
ax.scatter(sorted_feature_weights2[::-1], x, color='red')
ax.scatter(sorted_feature_weights3[::-1], x, color='blue')
ax.scatter(sorted_feature_weights4[::-1], x, color='green')
ax.set_yticks(range(35))
ax.set_yticklabels(sorted_features[::-1])
plt.xlabel("value")
plt.show()
Now we want to explore some of these features individually. Let's start with yards since in our principal component analysis it showed to be the feature with the most variability. Now we want to see how it correlates with our targets.
# yards and average salary relationship
plt.cla()
m, b = np.polyfit(passing_df['yards'], passing_df["AAV"],1)
plt.scatter( passing_df['yards'], passing_df["AAV"],)
plt.plot(passing_df['yards'], m*passing_df['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('Yards vs AAV')
Text(0.5, 1.0, 'Yards vs AAV')
It's fairly hard to tell what is going on but we do indeed see a positive correlation between the yards and the salary, with the players earning a higher salary generally throwing for more yards than those who aren't. The next notable feature we want to see is dropbacks.
# AAV and dropbacks relationship
plt.cla()
plt.scatter(passing_df['dropbacks'],passing_df["AAV"])
plt.xlabel('dropbacks')
plt.ylabel('AAV')
plt.title('dropbacks vs AAV')
Text(0.5, 1.0, 'dropbacks vs AAV')
It looks surprisingly similar. Let's look at pff grade, which again is a grade made by analysts at pro football focus where they grade every play of the player.
# AAV and pff grade relationship
plt.cla()
m, b = np.polyfit(passing_df['grades_offense'], passing_df["AAV"],1)
plt.scatter(passing_df['grades_offense'], passing_df["AAV"],)
plt.plot(passing_df['grades_offense'], m*passing_df['grades_offense'] + b, color='red')
plt.xlabel('PFF grade')
plt.ylabel('AAV')
plt.title('PFF grade vs AAV')
plt.show()
What is very noticeable between all of these graphs is the two clusters that appear in the plots. These clusters are separated by salary. The lower cluster represents the players on the rookie contracts while the upper cluster are players who have been paid second contracts, which are significantly higher. When a player is first drafted they are essentially assigned a contract that they play on for 4-5 years before being eligible for a new one. Typically, the rookie contracts are farily cheap, and if the player is good enough they either get extended or sign with another team, typically for a lot more money then they were making in the past. This is a very important thing to distinguish in our data. We will explore this further with a k means analysis which will give us a good idea of the split between our clusters.
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
# Create a dataset by stacking the features
data = passing_df[['AAV', 'yards']]
# Perform k-means clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Add cluster labels to the DataFrame
passing_df["Cluster"] = labels
# Plot the scatter plot
plt.scatter(passing_df["yards"], passing_df["AAV"], c=labels, cmap='viridis')
plt.scatter(centroids[:, 1], centroids[:, 0], c='red', marker='x', s=100)
# Set labels and title
plt.ylabel('Average Annual Value (AAV)')
plt.xlabel('Yards')
plt.title('K-means Clustering on AAV')
plt.ticklabel_format(style='plain')
# Display the plot
plt.show()
We get a very clear and obvious cluster, and the data within each cluster is vastly different with the yellow cluster being more condensed and the purple cluster being a lot more spread out.
from sklearn.linear_model import LinearRegression
data = passing_df[["yards", "AAV", "Cluster"]]
# Split the DataFrame into clusters
cluster_0_df = passing_df[passing_df["Cluster"] == 0]
cluster_1_df = passing_df[passing_df["Cluster"] == 1]
# Perform linear regression within each cluster
linear_regression_0 = LinearRegression()
linear_regression_0.fit(cluster_0_df[["yards"]], cluster_0_df["AAV"])
linear_regression_1 = LinearRegression()
linear_regression_1.fit(cluster_1_df[["yards"]], cluster_1_df["AAV"])
# Print the coefficients for each cluster
print("Cluster 0: Yards Run Coefficient =", linear_regression_0.coef_)
print("Cluster 1: Yards Run Coefficient =", linear_regression_1.coef_)
Cluster 0: Yards Run Coefficient = [598.48838876] Cluster 1: Yards Run Coefficient = [101.41286478]
import seaborn as sns
# Create a violin plot
sns.violinplot(x="Cluster", y="yards", data=data)
# Set labels and title
plt.xlabel('Cluster')
plt.ylabel('Yards Run')
plt.title('Yards Run Distribution by Cluster')
plt.show()
The above violin plot shows that the above 20 million players have a much higher yards run rate than the under 20 million players. This could be explained by how often players are allowed to play. Big name players get paid more, and get more opportunities to run more yards than lower paid players.
data = passing_df[["yards", "AAV", "Cluster", "player_game_count"]]
# Calculate yards per game
data["Yards Per Game"] = data["yards"] / data["player_game_count"]
# Create a violin plot
sns.violinplot(x="Cluster", y="Yards Per Game", data=data)
# Set labels and title
plt.xlabel('Cluster')
plt.ylabel('Yards Per Game')
plt.title('Yards Per Game Distribution by Cluster')
plt.show()
This gives us a better picture of how our clusters are distributed. In our left cluster we have more datapoints focused within one area at around 250 yards per game. Our other cluster is a tad more spread with the bulk of datapoints lying around 200 yards per game. We can clearly see there is a correlation between getting paid more and throwing for more yards. But does this mean that we can correlate well within our clusters to find more insightful discoveries? Let's explore further. First let's average out our quarterbacks stats rather than have them seperate by year in order to get a more general picture.
# create one row per player with their average for each feature
grouped_qbs = passing_df.groupby('player')[numeric_cols].mean()
grouped_qbs
| player_id | player_game_count | accuracy_percent | aimed_passes | attempts | avg_depth_of_target | avg_time_to_throw | bats | big_time_throws | btt_rate | ... | Rank | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| player | |||||||||||||||||||||
| Aaron Rodgers | 2241.0 | 15.500000 | 76.533333 | 485.500000 | 530.500000 | 8.416667 | 2.763333 | 3.833333 | 34.166667 | 5.850000 | ... | 11.0 | 38.0 | 3.0 | 150815000.0 | 50271667.0 | 40800000.0 | 101515000.0 | 150815000.0 | 2022.0 | 2024.0 |
| Andy Dalton | 6187.0 | 12.166667 | 73.333333 | 359.166667 | 389.500000 | 8.300000 | 2.525000 | 7.833333 | 16.833333 | 3.950000 | ... | 492.0 | 35.0 | 2.0 | 10000000.0 | 5000000.0 | 4835000.0 | 8000000.0 | 8000000.0 | 2023.0 | 2024.0 |
| Baker Mayfield | 46518.0 | 14.800000 | 72.260000 | 427.800000 | 466.000000 | 8.780000 | 2.806000 | 11.800000 | 26.200000 | 5.040000 | ... | 986.0 | 27.0 | 1.0 | 4000000.0 | 4000000.0 | 2875000.0 | 4000000.0 | 4000000.0 | 2023.0 | 2023.0 |
| Blaine Gabbert | 6162.0 | 5.500000 | 66.050000 | 129.500000 | 136.000000 | 8.750000 | 2.435000 | 3.500000 | 5.000000 | 3.300000 | ... | 1832.0 | 33.0 | 1.0 | 1317500.0 | 1317500.0 | 152500.0 | 1092500.0 | 1092500.0 | 2023.0 | 2023.0 |
| Brian Hoyer | 5277.0 | 8.000000 | 68.800000 | 202.000000 | 211.000000 | 8.400000 | 2.390000 | 3.000000 | 9.000000 | 4.100000 | ... | 836.0 | 37.0 | 2.0 | 4500000.0 | 2250000.0 | 875000.0 | 4205000.0 | 4205000.0 | 2023.0 | 2024.0 |
| Brock Purdy | 77672.0 | 10.000000 | 77.400000 | 212.000000 | 233.000000 | 7.800000 | 2.840000 | 7.000000 | 5.000000 | 2.000000 | ... | 1164.0 | 22.0 | 4.0 | 3737008.0 | 934252.0 | 77008.0 | 77008.0 | 77008.0 | 2022.0 | 2025.0 |
| C.J. Beathard | 11859.0 | 6.000000 | 73.566667 | 150.333333 | 165.666667 | 7.933333 | 2.550000 | 5.333333 | 5.000000 | 2.933333 | ... | 836.0 | 29.0 | 2.0 | 4500000.0 | 2250000.0 | 1000000.0 | 2000000.0 | 2000000.0 | 2023.0 | 2024.0 |
| Case Keenum | 7323.0 | 14.000000 | 74.800000 | 430.333333 | 467.333333 | 7.766667 | 2.626667 | 6.000000 | 18.666667 | 3.766667 | ... | 634.0 | 35.0 | 2.0 | 6250000.0 | 3125000.0 | 750000.0 | 4000000.0 | 4000000.0 | 2023.0 | 2024.0 |
| Colt McCoy | 5609.0 | 4.000000 | 75.800000 | 128.000000 | 132.000000 | 6.400000 | 2.500000 | 1.000000 | 5.000000 | 3.500000 | ... | 584.0 | 35.0 | 2.0 | 7500000.0 | 3750000.0 | 2500000.0 | 6000000.0 | 6000000.0 | 2022.0 | 2023.0 |
| Cooper Rush | 12324.0 | 8.000000 | 69.400000 | 144.000000 | 162.000000 | 8.900000 | 2.430000 | 7.000000 | 3.000000 | 1.800000 | ... | 753.0 | 29.0 | 2.0 | 5000000.0 | 2500000.0 | 1250000.0 | 2750000.0 | 2750000.0 | 2023.0 | 2024.0 |
| Dak Prescott | 10769.0 | 14.333333 | 75.900000 | 471.666667 | 500.333333 | 8.616667 | 2.743333 | 6.500000 | 22.333333 | 4.250000 | ... | 8.0 | 27.0 | 4.0 | 160000000.0 | 40000000.0 | 66000000.0 | 95000000.0 | 126000000.0 | 2021.0 | 2024.0 |
| Daniel Jones | 39395.0 | 14.000000 | 75.425000 | 413.500000 | 450.500000 | 7.550000 | 2.812500 | 8.000000 | 14.750000 | 3.075000 | ... | 8.0 | 25.0 | 4.0 | 160000000.0 | 40000000.0 | 36000000.0 | 81000000.0 | 92000000.0 | 2023.0 | 2026.0 |
| David Blough | 46508.0 | 5.000000 | 68.600000 | 159.000000 | 174.000000 | 8.900000 | 2.890000 | 6.000000 | 7.000000 | 3.700000 | ... | 1895.0 | 27.0 | 1.0 | 1232500.0 | 1232500.0 | 76250.0 | 76250.0 | 76250.0 | 2023.0 | 2023.0 |
| Davis Mills | 52269.0 | 14.000000 | 72.250000 | 402.500000 | 436.500000 | 8.100000 | 2.615000 | 10.000000 | 18.500000 | 3.950000 | ... | 714.0 | 22.0 | 4.0 | 5217531.0 | 1304383.0 | 1157892.0 | 1157892.0 | 1157892.0 | 2021.0 | 2024.0 |
| Derek Carr | 8671.0 | 16.000000 | 76.716667 | 506.000000 | 546.666667 | 8.166667 | 2.656667 | 8.000000 | 26.500000 | 4.533333 | ... | 12.0 | 31.0 | 4.0 | 150000000.0 | 37500000.0 | 28500000.0 | 60000000.0 | 100000000.0 | 2023.0 | 2026.0 |
| Deshaun Watson | 11767.0 | 12.600000 | 73.740000 | 383.000000 | 408.800000 | 9.360000 | 3.094000 | 4.400000 | 23.800000 | 5.220000 | ... | 7.0 | 26.0 | 5.0 | 230000000.0 | 46000000.0 | 44965000.0 | 230000000.0 | 230000000.0 | 2022.0 | 2026.0 |
| Desmond Ridder | 55086.0 | 4.000000 | 68.500000 | 108.000000 | 115.000000 | 8.400000 | 3.000000 | 1.000000 | 3.000000 | 2.500000 | ... | 697.0 | 22.0 | 4.0 | 5362959.0 | 1340740.0 | 1080336.0 | 1080336.0 | 1080336.0 | 2022.0 | 2025.0 |
| Drew Lock | 39517.0 | 8.000000 | 70.300000 | 219.666667 | 236.666667 | 9.066667 | 2.820000 | 3.666667 | 13.333333 | 4.500000 | ... | 986.0 | 26.0 | 1.0 | 4000000.0 | 4000000.0 | 1750000.0 | 1750000.0 | 1750000.0 | 2023.0 | 2023.0 |
| Gardner Minshew | 41401.0 | 11.500000 | 73.200000 | 366.000000 | 398.500000 | 8.200000 | 2.840000 | 5.000000 | 17.000000 | 4.150000 | ... | 1233.0 | 26.0 | 1.0 | 3500000.0 | 3500000.0 | 0.0 | 3500000.0 | 3500000.0 | 2023.0 | 2023.0 |
| Geno Smith | 7820.0 | 18.000000 | 77.700000 | 565.000000 | 607.000000 | 8.300000 | 2.790000 | 8.000000 | 35.000000 | 5.400000 | ... | 63.0 | 32.0 | 3.0 | 75000000.0 | 25000000.0 | 0.0 | 27300000.0 | 40000000.0 | 2023.0 | 2025.0 |
| Jacoby Brissett | 10725.0 | 13.000000 | 74.100000 | 343.500000 | 377.250000 | 8.400000 | 2.907500 | 4.750000 | 13.250000 | 3.375000 | ... | 561.0 | 30.0 | 1.0 | 8000000.0 | 8000000.0 | 4500000.0 | 7500000.0 | 7500000.0 | 2023.0 | 2023.0 |
| Jalen Hurts | 40291.0 | 14.666667 | 71.633333 | 351.333333 | 390.000000 | 9.433333 | 3.136667 | 9.333333 | 19.333333 | 4.633333 | ... | 4.0 | 24.0 | 5.0 | 255000000.0 | 51000000.0 | 23294000.0 | 110000000.0 | 179399000.0 | 2023.0 | 2028.0 |
| Jameis Winston | 9434.0 | 10.000000 | 71.100000 | 321.400000 | 344.400000 | 10.940000 | 2.818000 | 4.600000 | 17.200000 | 5.120000 | ... | 986.0 | 29.0 | 1.0 | 4000000.0 | 4000000.0 | 2835000.0 | 4000000.0 | 4000000.0 | 2023.0 | 2023.0 |
| Jared Goff | 10635.0 | 16.500000 | 75.483333 | 528.833333 | 582.333333 | 7.850000 | 2.720000 | 11.333333 | 23.666667 | 3.783333 | ... | 17.0 | 24.0 | 4.0 | 134000000.0 | 33500000.0 | 25000000.0 | 57042682.0 | 110042682.0 | 2019.0 | 2024.0 |
| Jimmy Garoppolo | 8697.0 | 12.000000 | 75.900000 | 316.200000 | 334.800000 | 7.540000 | 2.566000 | 6.800000 | 9.600000 | 2.760000 | ... | 69.0 | 31.0 | 3.0 | 72750000.0 | 24250000.0 | 11250000.0 | 33750000.0 | 45000000.0 | 2023.0 | 2025.0 |
| Joe Burrow | 28022.0 | 16.333333 | 76.866667 | 553.333333 | 593.666667 | 8.233333 | 2.573333 | 12.666667 | 30.333333 | 4.700000 | ... | 185.0 | 23.0 | 4.0 | 36190137.0 | 9047534.0 | 23880100.0 | 36190137.0 | 36190137.0 | 2020.0 | 2024.0 |
| Joe Flacco | 4332.0 | 8.600000 | 73.400000 | 272.600000 | 303.000000 | 8.620000 | 2.614000 | 7.200000 | 10.200000 | 3.400000 | ... | 1233.0 | 36.0 | 1.0 | 3500000.0 | 3500000.0 | 1730000.0 | 2850000.0 | 2850000.0 | 2022.0 | 2022.0 |
| Josh Allen | 46601.0 | 17.000000 | 72.600000 | 528.800000 | 575.000000 | 9.960000 | 2.992000 | 10.400000 | 34.600000 | 5.460000 | ... | 135.5 | 23.0 | 5.0 | 140388319.0 | 24345663.5 | 15530050.0 | 61390617.0 | 86371319.0 | 2020.0 | 2025.5 |
| Justin Fields | 82118.0 | 13.500000 | 69.300000 | 271.500000 | 294.000000 | 9.800000 | 3.255000 | 6.500000 | 15.500000 | 4.950000 | ... | 327.0 | 22.0 | 4.0 | 18871957.0 | 4717989.0 | 11085060.0 | 18871957.0 | 18871957.0 | 2021.0 | 2025.0 |
| Justin Herbert | 28237.0 | 16.666667 | 77.233333 | 616.666667 | 670.000000 | 7.533333 | 2.690000 | 17.333333 | 27.333333 | 3.933333 | ... | 243.0 | 22.0 | 4.0 | 26578755.0 | 6644689.0 | 16890004.0 | 26578755.0 | 26578755.0 | 2020.0 | 2024.0 |
| Kenny Pickett | 60515.0 | 13.000000 | 73.300000 | 360.000000 | 389.000000 | 8.000000 | 2.990000 | 8.000000 | 18.000000 | 4.300000 | ... | 393.0 | 23.0 | 4.0 | 14067905.0 | 3516976.0 | 7411204.0 | 14067905.0 | 14067905.0 | 2022.0 | 2026.0 |
| Kirk Cousins | 7102.0 | 16.500000 | 77.850000 | 520.833333 | 568.166667 | 8.250000 | 2.695000 | 11.333333 | 26.333333 | 4.366667 | ... | 191.0 | 33.0 | 1.0 | 35000000.0 | 35000000.0 | 25000000.0 | 35000000.0 | 35000000.0 | 2022.0 | 2023.0 |
| Kyle Allen | 26152.0 | 13.000000 | 73.300000 | 450.000000 | 489.000000 | 8.600000 | 2.690000 | 7.000000 | 20.000000 | 3.900000 | ... | 1895.0 | 27.0 | 1.0 | 1232500.0 | 1232500.0 | 100000.0 | 350000.0 | 350000.0 | 2023.0 | 2023.0 |
| Kyler Murray | 38334.0 | 14.500000 | 76.025000 | 458.500000 | 501.000000 | 7.825000 | 2.737500 | 10.250000 | 25.500000 | 4.675000 | ... | 6.0 | 24.0 | 5.0 | 230500000.0 | 46100000.0 | 29035000.0 | 103300000.0 | 159797000.0 | 2022.0 | 2028.0 |
| Lamar Jackson | 46416.0 | 14.000000 | 72.860000 | 331.400000 | 358.200000 | 9.380000 | 3.110000 | 7.800000 | 18.400000 | 4.720000 | ... | 2.0 | 26.0 | 5.0 | 260000000.0 | 52000000.0 | 72500000.0 | 135000000.0 | 185000000.0 | 2023.0 | 2027.0 |
| Mac Jones | 60323.0 | 16.000000 | 74.650000 | 468.500000 | 500.500000 | 8.250000 | 2.655000 | 8.500000 | 24.000000 | 4.550000 | ... | 365.0 | 22.0 | 4.0 | 15586352.0 | 3896588.0 | 8695528.0 | 15586352.0 | 15586352.0 | 2021.0 | 2025.0 |
| Marcus Mariota | 9435.0 | 13.000000 | 72.175000 | 309.000000 | 328.250000 | 8.875000 | 2.875000 | 3.500000 | 14.250000 | 3.675000 | ... | 753.0 | 29.0 | 1.0 | 5000000.0 | 5000000.0 | 3835000.0 | 5000000.0 | 5000000.0 | 2023.0 | 2023.0 |
| Mason Rudolph | 46569.0 | 10.000000 | 74.500000 | 259.000000 | 283.000000 | 8.100000 | 2.740000 | 3.000000 | 6.000000 | 2.000000 | ... | 737.0 | 25.0 | 1.0 | 5080000.0 | 5080000.0 | 2080000.0 | 2080000.0 | 2080000.0 | 2021.0 | 2022.0 |
| Matthew Stafford | 4924.0 | 14.333333 | 74.033333 | 471.166667 | 497.166667 | 8.766667 | 2.598333 | 7.666667 | 30.000000 | 5.800000 | ... | 8.0 | 34.0 | 4.0 | 160000000.0 | 40000000.0 | 60000000.0 | 63000000.0 | 130000000.0 | 2022.0 | 2026.0 |
| Mike White | 46453.0 | 4.000000 | 73.000000 | 140.500000 | 153.500000 | 7.200000 | 2.565000 | 5.000000 | 5.500000 | 3.300000 | ... | 561.0 | 27.0 | 2.0 | 8000000.0 | 4000000.0 | 3420000.0 | 4500000.0 | 4500000.0 | 2023.0 | 2024.0 |
| Mitchell Trubisky | 11757.0 | 11.800000 | 71.300000 | 348.200000 | 365.800000 | 8.840000 | 2.714000 | 3.600000 | 16.400000 | 4.380000 | ... | 388.0 | 27.0 | 2.0 | 14285000.0 | 7142500.0 | 5250000.0 | 5250000.0 | 5250000.0 | 2022.0 | 2023.0 |
| Nick Mullens | 12282.0 | 9.000000 | 76.900000 | 269.000000 | 300.000000 | 7.200000 | 2.515000 | 9.500000 | 7.500000 | 2.300000 | ... | 986.0 | 27.0 | 2.0 | 4000000.0 | 2000000.0 | 750000.0 | 1922500.0 | 1922500.0 | 2023.0 | 2024.0 |
| Patrick Mahomes | 11765.0 | 18.600000 | 77.860000 | 638.600000 | 696.000000 | 8.300000 | 2.854000 | 8.200000 | 41.000000 | 5.560000 | ... | 1.0 | 24.0 | 10.0 | 450000000.0 | 45000000.0 | 10000000.0 | 63081905.0 | 141481905.0 | 2020.0 | 2031.0 |
| Russell Wilson | 7077.0 | 16.166667 | 75.250000 | 466.500000 | 508.833333 | 9.850000 | 2.966667 | 7.833333 | 37.000000 | 6.750000 | ... | 5.0 | 33.0 | 5.0 | 242588236.0 | 48517647.0 | 50000000.0 | 124000000.0 | 161000000.0 | 2022.0 | 2028.0 |
| Ryan Tannehill | 7014.0 | 14.400000 | 75.800000 | 372.000000 | 401.400000 | 8.720000 | 2.688000 | 8.800000 | 16.800000 | 3.920000 | ... | 21.0 | 31.0 | 4.0 | 118000000.0 | 29500000.0 | 20000000.0 | 62000000.0 | 91000000.0 | 2020.0 | 2023.0 |
| Sam Darnold | 29048.0 | 11.200000 | 72.480000 | 315.400000 | 353.000000 | 8.700000 | 2.896000 | 5.600000 | 13.800000 | 3.940000 | ... | 836.0 | 25.0 | 1.0 | 4500000.0 | 4500000.0 | 1500000.0 | 3500000.0 | 3500000.0 | 2023.0 | 2023.0 |
| Sam Ehlinger | 60555.0 | 4.000000 | 71.100000 | 97.000000 | 101.000000 | 7.800000 | 3.020000 | 2.000000 | 3.000000 | 2.900000 | ... | 1201.0 | 22.0 | 4.0 | 3610708.0 | 902677.0 | 130708.0 | 130708.0 | 130708.0 | 2021.0 | 2024.0 |
| Skylar Thompson | 60417.0 | 8.000000 | 64.400000 | 135.000000 | 150.000000 | 9.000000 | 3.040000 | 1.000000 | 4.000000 | 2.500000 | ... | 1159.0 | 24.0 | 4.0 | 3742216.0 | 935554.0 | 82216.0 | 82216.0 | 82216.0 | 2022.0 | 2025.0 |
| Taylor Heinicke | 9896.0 | 12.500000 | 72.150000 | 347.500000 | 376.500000 | 8.700000 | 2.945000 | 7.500000 | 13.500000 | 3.400000 | ... | 398.0 | 29.0 | 2.0 | 14000000.0 | 7000000.0 | 4000000.0 | 6320000.0 | 6320000.0 | 2023.0 | 2024.0 |
| Taysom Hill | 12112.0 | 12.000000 | 77.900000 | 119.000000 | 127.500000 | 7.700000 | 2.970000 | 1.000000 | 2.500000 | 1.900000 | ... | 163.0 | 31.0 | 4.0 | 40000000.0 | 10000000.0 | 0.0 | 10100000.0 | 21500000.0 | 2021.0 | 2025.0 |
| Teddy Bridgewater | 8667.0 | 12.000000 | 79.000000 | 342.666667 | 371.333333 | 7.366667 | 2.770000 | 7.333333 | 12.333333 | 3.200000 | ... | 621.0 | 29.0 | 1.0 | 6500000.0 | 6500000.0 | 2000000.0 | 6500000.0 | 6500000.0 | 2022.0 | 2022.0 |
| Trevor Lawrence | 77632.0 | 18.000000 | 74.350000 | 585.500000 | 636.000000 | 8.050000 | 2.670000 | 13.000000 | 28.500000 | 4.200000 | ... | 182.0 | 21.0 | 4.0 | 36793488.0 | 9198372.0 | 24118900.0 | 27598900.0 | 36793488.0 | 2021.0 | 2025.0 |
| Tua Tagovailoa | 60326.0 | 12.000000 | 74.566667 | 341.000000 | 359.333333 | 8.500000 | 2.523333 | 2.666667 | 11.666667 | 3.000000 | ... | 221.0 | 22.0 | 4.0 | 30275438.0 | 7568860.0 | 19578500.0 | 30275438.0 | 30275438.0 | 2020.0 | 2024.0 |
| Tyler Huntley | 46448.0 | 6.000000 | 73.350000 | 152.000000 | 164.500000 | 7.750000 | 2.905000 | 4.000000 | 3.000000 | 1.750000 | ... | 1341.0 | 25.0 | 1.0 | 2627000.0 | 2627000.0 | 0.0 | 0.0 | 0.0 | 2023.0 | 2023.0 |
| Tyrod Taylor | 6332.0 | 11.000000 | 70.950000 | 280.000000 | 303.500000 | 8.150000 | 2.990000 | 8.500000 | 10.000000 | 2.900000 | ... | 480.0 | 29.0 | 2.0 | 11000000.0 | 5500000.0 | 4200000.0 | 8170000.0 | 8170000.0 | 2022.0 | 2023.0 |
| Zach Wilson | 82096.0 | 11.000000 | 67.800000 | 279.500000 | 312.000000 | 8.850000 | 3.060000 | 8.000000 | 9.000000 | 2.850000 | ... | 190.0 | 21.0 | 4.0 | 35150681.0 | 8787670.0 | 22924132.0 | 35150681.0 | 35150681.0 | 2021.0 | 2025.0 |
56 rows × 50 columns
Now we want to see the relationship between a players average yards and their salary.
plt.cla()
m, b = np.polyfit(grouped_qbs['yards'],grouped_qbs['AAV'] , 1)
plt.scatter(grouped_qbs['yards'], grouped_qbs['AAV'],)
plt.plot(grouped_qbs['yards'], grouped_qbs['yards']*m + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('Player average yards vs AAV')
Text(0.5, 1.0, 'Player average yards vs AAV')
Now let's analyze our clusters. Let's start with the less than 20 million cluster.
plt.cla()
# lower salary cluster
m, b = np.polyfit(cluster_1_df['yards'], cluster_1_df['AAV'], 1)
plt.scatter(cluster_1_df['yards'], cluster_1_df['AAV'])
plt.plot(cluster_1_df['yards'], cluster_1_df['yards']*m + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('lower cluster yards VS AAV')
Text(0.5, 1.0, 'lower cluster yards VS AAV')
More than 20 million cluster
plt.cla()
# upper salary cluster
m, b = np.polyfit( cluster_0_df['yards'], cluster_0_df['AAV'], 1)
plt.scatter(cluster_0_df['yards'],cluster_0_df['AAV'],)
plt.plot(cluster_0_df['yards'], cluster_0_df['yards']*m + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('upper cluster yards VS AAV')
Text(0.5, 1.0, 'upper cluster yards VS AAV')
Now let's look at the relationship between players before they receive their contract.
# yards and average salary relationship
plt.cla()
m, b = np.polyfit(passing_df[passing_df['year'] < passing_df['start_year']]['yards'], passing_df[passing_df['year'] < passing_df['start_year']]["AAV"],1)
plt.scatter(passing_df[passing_df['year'] < passing_df['start_year']]['yards'], passing_df[passing_df['year'] < passing_df['start_year']]["AAV"],)
plt.plot(passing_df[passing_df['year'] < passing_df['start_year']]['yards'], m*passing_df[passing_df['year'] < passing_df['start_year']]['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('Yards before contract year VS AAV')
Text(0.5, 1.0, 'Yards before contract year VS AAV')
Now let's look at how players perform after they receive their contracts.
# yards and average salary relationship
plt.cla()
m, b = np.polyfit(passing_df[passing_df['year'] >= passing_df['start_year']]['yards'], passing_df[passing_df['year'] >= passing_df['start_year']]["AAV"],1)
plt.scatter(passing_df[passing_df['year'] >= passing_df['start_year']]['yards'], passing_df[passing_df['year'] >= passing_df['start_year']]["AAV"],)
plt.plot(passing_df[passing_df['year'] >= passing_df['start_year']]['yards'], m*passing_df[passing_df['year'] >= passing_df['start_year']]['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('yards after contract year VS AAV')
Text(0.5, 1.0, 'yards after contract year VS AAV')
Now let's explore the upper cluster, the quarterbacks who have received a notable second contract. Understanding these players will help us use our data to figure out how to better predict what contracts should look like.
# yards before contract and average salary relationship
plt.cla()
m, b = np.polyfit(cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]['yards'], cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]["AAV"],1)
plt.scatter(cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]['yards'], cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]["AAV"],)
plt.plot(cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]['yards'], m*cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('upper cluster yards before contract VS AAV')
Text(0.5, 1.0, 'upper cluster yards before contract VS AAV')
# yards after contract and average salary relationship
plt.cla()
m, b = np.polyfit(cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]['yards'], cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]["AAV"],1)
plt.scatter(cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]['yards'], cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]["AAV"],)
plt.plot(cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]['yards'], m*cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('upper cluster after contract VS AAV')
Text(0.5, 1.0, 'upper cluster after contract VS AAV')
# pff grade and average salary relationship
plt.cla()
m, b = np.polyfit(cluster_0_df['grades_offense'], cluster_0_df["AAV"],1)
plt.scatter(cluster_0_df['grades_offense'], cluster_0_df["AAV"],)
plt.plot(cluster_0_df['grades_offense'], m*cluster_0_df['grades_offense'] + b, color='red')
plt.xlabel('PFF Grade')
plt.ylabel('AAV')
plt.title('upper cluster after before contract VS AAV')
Text(0.5, 1.0, 'upper cluster after before contract VS AAV')
# big time throws and average salary relationship
plt.cla()
m, b = np.polyfit(cluster_0_df['big_time_throws'], cluster_0_df["AAV"],1)
plt.scatter(cluster_0_df['big_time_throws'], cluster_0_df["AAV"],)
plt.plot(cluster_0_df['big_time_throws'], m*cluster_0_df['big_time_throws'] + b, color='red')
plt.xlabel('big time throws')
plt.ylabel('AAV')
plt.title('upper cluster after before contract VS AAV')
Text(0.5, 1.0, 'upper cluster after before contract VS AAV')
There appears to be high correlation between salary and yards for players who are paid less. However, for players that are paid high, there is negiligeable correlation, which means that it is both not strong and not a good predictor for future salary. However, we do see slightly more promise with other features such as big time throws and pff grade. We will now further explore feature selection in order to see if we can select a good amounut of features that will be able to predict salary together. We want to now figure out any features that are redundant with each other or irrelevant to predicting the salary. Understanding our data in this way will allow for cleaner built models that converge quicker, generalize and produce more accurate results more efficiently. We will start by seeing how our features correlate with the AAV and to drop features in which there are none.
targets = ['Value', 'AAV', 'Sign Bonus', "G'teed @ Sign", "Practical G'teed"]
# features = passing_df.select_dtypes(include=[np.number])
aav_corr = passing_df[list(features.columns) + ['AAV']].corr()['AAV']
abs(aav_corr).sort_values()
grades_hands_fumble 0.000380 declined_penalties 0.011715 sack_percent 0.060447 hit_as_threw 0.071293 pressure_to_sack_rate 0.084187 interceptions 0.087623 avg_depth_of_target 0.101817 bats 0.127425 drop_rate 0.133226 spikes 0.202030 turnover_worthy_plays 0.223162 grades_run 0.230031 avg_time_to_throw 0.230634 twp_rate 0.236639 penalties 0.244877 completion_percent 0.288256 accuracy_percent 0.305657 drops 0.308714 sacks 0.353610 scrambles 0.365247 btt_rate 0.381945 ypa 0.400228 thrown_aways 0.403218 aimed_passes 0.408570 def_gen_pressures 0.424604 dropbacks 0.425085 completions 0.427110 player_game_count 0.432702 first_downs 0.457631 yards 0.458107 grades_pass 0.459533 big_time_throws 0.469766 grades_offense 0.482950 qb_rating 0.498848 touchdowns 0.503470 AAV 1.000000 Name: AAV, dtype: float64
The cutoff we will decide on is spikes, which based on intuition about football has very little to do with analying performance of a player. Everything with less correlation to AAV than that will also be dropped.
aav_corr = aav_corr[aav_corr > 0.21]
features = features[aav_corr.drop('AAV').index]
aav_corr
player_game_count 0.432702 accuracy_percent 0.305657 aimed_passes 0.408570 avg_time_to_throw 0.230634 big_time_throws 0.469766 btt_rate 0.381945 completion_percent 0.288256 completions 0.427110 def_gen_pressures 0.424604 dropbacks 0.425085 drops 0.308714 first_downs 0.457631 grades_offense 0.482950 grades_pass 0.459533 grades_run 0.230031 penalties 0.244877 qb_rating 0.498848 sacks 0.353610 scrambles 0.365247 thrown_aways 0.403218 touchdowns 0.503470 turnover_worthy_plays 0.223162 yards 0.458107 ypa 0.400228 AAV 1.000000 Name: AAV, dtype: float64
Now we have reduced our feature set a little bit. Now we want to remove features that are redundant with each other. To do this we will use the pandas scatter matrix plotting function to visualise this intuitively. What this does is that it will plot the correlation between every feature individually. With this we can see features that are highly correlated and therefore redundant with each other.
pd.plotting.scatter_matrix(features[aav_corr.drop('AAV').index], diagonal='kde', figsize=(30, 30))
for ax in plt.gcf().axes:
ax.tick_params(labelsize=12)
ax.xaxis.label.set_fontsize(7)
ax.yaxis.label.set_fontsize(7)
plt.show()
From this graph we can observe a few datapoints to drop. First off aimed passes is a highly redundant feature so we will drop it. Other features are similar, including completions, dropbacks, first downs, big time throws and yards. Because big time throws is the most correlated with AAV we will keep that one and drop the others. Offense and passing grades are also highly redundant as passing grade is a subset of the offensive grade, we will drop this. We can see some correlation in other areas but we will leave those for now. This leaves us with now 18 features.
features = features.drop(['aimed_passes', 'completions', 'dropbacks', 'first_downs', 'grades_pass', 'yards'], axis=1)
corr_matrix = features.corr()
corr_matrix
| player_game_count | accuracy_percent | avg_time_to_throw | big_time_throws | btt_rate | completion_percent | def_gen_pressures | drops | grades_offense | grades_run | penalties | qb_rating | sacks | scrambles | thrown_aways | touchdowns | turnover_worthy_plays | ypa | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| player_game_count | 1.000000 | 0.437641 | 0.099989 | 0.772411 | 0.351008 | 0.371896 | 0.888488 | 0.794328 | 0.621176 | 0.282298 | 0.523688 | 0.482449 | 0.744253 | 0.577548 | 0.742027 | 0.816649 | 0.706002 | 0.356817 |
| accuracy_percent | 0.437641 | 1.000000 | -0.239002 | 0.357308 | 0.071976 | 0.847651 | 0.361914 | 0.380525 | 0.589884 | 0.134533 | 0.184160 | 0.672444 | 0.282167 | 0.122394 | 0.341557 | 0.493866 | 0.126579 | 0.432872 |
| avg_time_to_throw | 0.099989 | -0.239002 | 1.000000 | 0.077914 | 0.174100 | -0.307968 | 0.186844 | -0.024387 | 0.056687 | 0.331428 | 0.135266 | -0.031181 | 0.175699 | 0.583485 | 0.202739 | 0.030019 | 0.020522 | 0.008194 |
| big_time_throws | 0.772411 | 0.357308 | 0.077914 | 1.000000 | 0.787101 | 0.350747 | 0.781584 | 0.686443 | 0.745718 | 0.290445 | 0.400217 | 0.569391 | 0.569285 | 0.522322 | 0.668408 | 0.877429 | 0.589140 | 0.455350 |
| btt_rate | 0.351008 | 0.071976 | 0.174100 | 0.787101 | 1.000000 | 0.146743 | 0.385275 | 0.281415 | 0.590632 | 0.235507 | 0.218078 | 0.477410 | 0.280548 | 0.335089 | 0.327626 | 0.561176 | 0.250768 | 0.455223 |
| completion_percent | 0.371896 | 0.847651 | -0.307968 | 0.350747 | 0.146743 | 1.000000 | 0.284873 | 0.164711 | 0.555226 | 0.075198 | 0.143891 | 0.727730 | 0.242659 | 0.092600 | 0.140309 | 0.475021 | 0.111345 | 0.507245 |
| def_gen_pressures | 0.888488 | 0.361914 | 0.186844 | 0.781584 | 0.385275 | 0.284873 | 1.000000 | 0.770062 | 0.565647 | 0.297715 | 0.439443 | 0.396939 | 0.799194 | 0.668934 | 0.751959 | 0.781361 | 0.735876 | 0.278926 |
| drops | 0.794328 | 0.380525 | -0.024387 | 0.686443 | 0.281415 | 0.164711 | 0.770062 | 1.000000 | 0.513224 | 0.222321 | 0.348474 | 0.266030 | 0.563719 | 0.448961 | 0.647000 | 0.688533 | 0.661611 | 0.156339 |
| grades_offense | 0.621176 | 0.589884 | 0.056687 | 0.745718 | 0.590632 | 0.555226 | 0.565647 | 0.513224 | 1.000000 | 0.441164 | 0.275276 | 0.740568 | 0.392531 | 0.436743 | 0.475274 | 0.752588 | 0.236734 | 0.607530 |
| grades_run | 0.282298 | 0.134533 | 0.331428 | 0.290445 | 0.235507 | 0.075198 | 0.297715 | 0.222321 | 0.441164 | 1.000000 | 0.142861 | 0.215324 | 0.239035 | 0.528501 | 0.206591 | 0.305779 | 0.182653 | 0.168180 |
| penalties | 0.523688 | 0.184160 | 0.135266 | 0.400217 | 0.218078 | 0.143891 | 0.439443 | 0.348474 | 0.275276 | 0.142861 | 1.000000 | 0.180349 | 0.443809 | 0.311807 | 0.494594 | 0.387091 | 0.376580 | 0.116272 |
| qb_rating | 0.482449 | 0.672444 | -0.031181 | 0.569391 | 0.477410 | 0.727730 | 0.396939 | 0.266030 | 0.740568 | 0.215324 | 0.180349 | 1.000000 | 0.275732 | 0.254520 | 0.323622 | 0.732861 | 0.167743 | 0.795439 |
| sacks | 0.744253 | 0.282167 | 0.175699 | 0.569285 | 0.280548 | 0.242659 | 0.799194 | 0.563719 | 0.392531 | 0.239035 | 0.443809 | 0.275732 | 1.000000 | 0.571604 | 0.547841 | 0.530947 | 0.575722 | 0.227301 |
| scrambles | 0.577548 | 0.122394 | 0.583485 | 0.522322 | 0.335089 | 0.092600 | 0.668934 | 0.448961 | 0.436743 | 0.528501 | 0.311807 | 0.254520 | 0.571604 | 1.000000 | 0.502622 | 0.507499 | 0.449490 | 0.174856 |
| thrown_aways | 0.742027 | 0.341557 | 0.202739 | 0.668408 | 0.327626 | 0.140309 | 0.751959 | 0.647000 | 0.475274 | 0.206591 | 0.494594 | 0.323622 | 0.547841 | 0.502622 | 1.000000 | 0.665953 | 0.529567 | 0.175107 |
| touchdowns | 0.816649 | 0.493866 | 0.030019 | 0.877429 | 0.561176 | 0.475021 | 0.781361 | 0.688533 | 0.752588 | 0.305779 | 0.387091 | 0.732861 | 0.530947 | 0.507499 | 0.665953 | 1.000000 | 0.607125 | 0.555451 |
| turnover_worthy_plays | 0.706002 | 0.126579 | 0.020522 | 0.589140 | 0.250768 | 0.111345 | 0.735876 | 0.661611 | 0.236734 | 0.182653 | 0.376580 | 0.167743 | 0.575722 | 0.449490 | 0.529567 | 0.607125 | 1.000000 | 0.192064 |
| ypa | 0.356817 | 0.432872 | 0.008194 | 0.455350 | 0.455223 | 0.507245 | 0.278926 | 0.156339 | 0.607530 | 0.168180 | 0.116272 | 0.795439 | 0.227301 | 0.174856 | 0.175107 | 0.555451 | 0.192064 | 1.000000 |
features.columns
Index(['player_game_count', 'accuracy_percent', 'avg_time_to_throw',
'big_time_throws', 'btt_rate', 'completion_percent',
'def_gen_pressures', 'drops', 'grades_offense', 'grades_run',
'penalties', 'qb_rating', 'sacks', 'scrambles', 'thrown_aways',
'touchdowns', 'turnover_worthy_plays', 'ypa'],
dtype='object')
pd.plotting.scatter_matrix(features[corr_matrix.index], diagonal='kde', figsize=(30, 30))
for ax in plt.gcf().axes:
ax.tick_params(labelsize=12)
ax.xaxis.label.set_fontsize(7)
ax.yaxis.label.set_fontsize(7)
plt.show()
Nothing seems blatenly correlated. Now we want to test how many features are considered valueable for our dataset. For this we will do a sequential feature selection and test out how effective the number of features are on a linear regression model. The SequencialFeatureSelection from sklearn is a greedy algorithm which will recursively select the next best feature for the model and go up to the specified amount. We looped over all possible number of features to see if it converged early so that we could possibly remove some features.
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Create a LinearRegression model
lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(features, passing_df['AAV'], test_size=0.2, random_state=42)
for i in range(1, len(features.columns)):
# Create a SequentialFeatureSelector
sfs = SequentialFeatureSelector(lr, n_features_to_select=i, direction='forward')
# Fit the SequentialFeatureSelector on the data
sfs.fit(X_train, y_train)
# Get the selected feature indices
selected_features = sfs.get_support(indices=True)
selected_feature_names = list(features.columns[selected_features])
lr.fit(X_train[selected_feature_names], y_train)
r_squared = lr.score(X_test[selected_feature_names], y_test)
print(f'{i} features, R-squared: {r_squared}')
1 features, R-squared: 0.244606316577741 2 features, R-squared: 0.09224184085127551 3 features, R-squared: 0.1254399814209034 4 features, R-squared: 0.12996871273586075 5 features, R-squared: 0.1220496126531656 6 features, R-squared: 0.1192598219799419 7 features, R-squared: 0.1291698097568209 8 features, R-squared: 0.11305164951905311 9 features, R-squared: 0.11170548528584112 10 features, R-squared: 0.08477864360019127 11 features, R-squared: 0.08356644818467918 12 features, R-squared: 0.08352904118312476 13 features, R-squared: 0.07484603528129596 14 features, R-squared: 0.05595573976768642 15 features, R-squared: 0.05740324208111103 16 features, R-squared: 0.03812686600087556 17 features, R-squared: 0.03288428094003615
There is a big jump from 15 to 16 features but there is not as big of one from 16 to 17, so we will move forward with the top 16 features.
sfs = SequentialFeatureSelector(lr, n_features_to_select=16, direction='forward')
sfs.fit(X_train, y_train)
# Get the selected feature indices
selected_features = sfs.get_support(indices=True)
selected_feature_names = list(features.columns[selected_features]) + ['start_year']
features = passing_df[selected_feature_names]
features.columns
Index(['player_game_count', 'accuracy_percent', 'avg_time_to_throw',
'big_time_throws', 'btt_rate', 'completion_percent', 'drops',
'grades_offense', 'grades_run', 'penalties', 'qb_rating', 'sacks',
'scrambles', 'thrown_aways', 'turnover_worthy_plays', 'ypa',
'start_year'],
dtype='object')
Now we will do some machine learning to create a model that can predict salary. We will use some of the observations we used from our exploration analysis to experiment with multiple models. For the most part, we will be using neural networks. We will be using tensorflow, a python library that creates neural networks and performs calculations into a model. It allows us to specify how many hidden layers and nodes we have. The finer details are handeled on the back end like the weight adjustments, which will determine how we get from input to output. For our model, we will use 1 output since our targets are already closely correlated. We will use relu activation function which will not allow us to have negative values ane we will normalize all our data. We will train on 80% of our data and hold out 20% for testing. We will have 2 hidden layers.
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from sklearn.metrics import r2_score
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, passing_df[['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(31, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
from sklearn.metrics import r2_score
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
2/2 [==============================] - 0s 17ms/step R^2 score: -0.27057328396322666
plt.cla()
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('QB NN')
plt.xlabel('test data')
plt.ylabel('predicted data')
Text(0, 0.5, 'predicted data')
Let's now train on how players perform before their contract.
passing_df[passing_df['year'] < passing_df['start_year']]
features = passing_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, passing_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(31, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
1/1 [==============================] - 0s 48ms/step R^2 score: -0.1760114269782802
plt.cla()
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('QB NN before contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
Text(0, 0.5, 'predicted data')
passing_df[passing_df['year'] < passing_df['start_year']]
features = cluster_1_df[passing_df['dropbacks'] > 300][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(31, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
1/1 [==============================] - 0s 48ms/step R^2 score: 0.6845601843548529
plt.cla()
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('Upper cluster QB NN')
plt.xlabel('test data')
plt.ylabel('predicted data')
Text(0, 0.5, 'predicted data')
Now lets train on our upper cluster before they receive their contracts.
features = cluster_1_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(31, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score (test):', r2)
1/1 [==============================] - 0s 48ms/step R^2 score (test): 0.6683308312916421
nn = model
plt.cla()
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('QB NN before contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
Text(0, 0.5, 'predicted data')
Lets train on quarterbacks after they receive their contracts.
passing_df[passing_df['year'] < passing_df['start_year']]
features = passing_df[passing_df['start_year'] >= passing_df['year']][passing_df['dropbacks'] > 300][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, passing_df[passing_df['start_year'] >= passing_df['year']][passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(31, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
1/1 [==============================] - 0s 49ms/step R^2 score: 0.21536589007912033
plt.plot(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred), 'o')
plt.title('QB NN after contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
plt.show()
passing_df[passing_df['year'] < passing_df['start_year']]
features = cluster_1_df[passing_df['start_year'] <= passing_df['year']][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['start_year'] <= passing_df['year']][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
model = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(31, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
1/1 [==============================] - 0s 51ms/step R^2 score: -0.17128032098927815
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('QB NN after contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
plt.show()
There simply isn't enough data to make any conclusions from here. From our neural network we found that both our upper cluster and players before their contract offer more accurate training data. The most accurate we got was the model where we trained on the cluster of quarterbacks that received their second contracts and look at how they played before their contracts. This proved to be a fairly accurate model in terms of being able to predict salary on our testing data. Now lets try a couple of other machine learning models starting with K-nearest neighbors. This is a model which takes a new data point and looks at the points around it that are known and makes a prediction based off of them. Sklearn has a function that allows us to do this while specifying the amount of neighbors we want to observe.
from sklearn.neighbors import KNeighborsRegressor
features = passing_df[selected_feature_names]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, passing_df[['AAV']], test_size=0.2, random_state=42)
# Create KNN regressor
knn = KNeighborsRegressor(n_neighbors=20)
# Fit the model on the training data
knn.fit(X_train, y_train)
# Predict on the test data
y_pred = knn.predict(X_test)
# Evaluate the model
print("R^2 Score:", r2_score(y_test, y_pred))
R^2 Score: 0.2716735737097594
features = cluster_1_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][selected_feature_names]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)
# Create KNN regressor
knn = KNeighborsRegressor(n_neighbors=5)
# Fit the model on the training data
knn.fit(X_train, y_train)
# Predict on the test data
y_pred = knn.predict(X_test)
# Evaluate the model
print("R^2 Score:", r2_score(y_test, y_pred))
R^2 Score: 0.687334875311706
plt.plot(y_test, y_pred, 'o')
plt.title('QB NN KNN contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
plt.show()
This is comparable to our neural net but still a pretty good model in terms of what we're looking for. Now let's try linear regression.
features = passing_df[selected_feature_names]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, passing_df['AAV'], test_size=0.2, random_state=42)
# Create a linear regression model
lr = LinearRegression()
# Fit the model to the training data
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print("R^2 Score:", r2_score(y_test, y_pred))
R^2 Score: 0.00616578919961297
features = cluster_1_df[passing_df['start_year'] > passing_df['year']][selected_feature_names]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['start_year'] > passing_df['year']][['AAV']], test_size=0.2, random_state=42)
# Create a linear regression model
lr = LinearRegression()
# Fit the model to the training data
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print("R^2 Score:", r2_score(y_test, y_pred))
R^2 Score: 0.15657887317565966
plt.plot(y_test, y_pred, 'o')
plt.title('QB NN KNN contract year linear regression')
plt.xlabel('test data')
plt.ylabel('predicted data')
plt.show()
This is clearly not as good as our KNN or neural network. Now let's use our models to predict the average salary of the next two quarterbacks anticipated to get massive contracts, Joe Burrow and Justin Herbert.
justin_herbert = passing_df[passing_df['player'] == 'Justin Herbert']
justin_herbert
| player | player_id | position | team_name | player_game_count | accuracy_percent | aimed_passes | attempts | avg_depth_of_target | avg_time_to_throw | ... | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | Justin Herbert | 28237 | QB | LAC | 18 | 78.8 | 673 | 743 | 6.9 | 2.74 | ... | 22 | 4 | 26578755 | 6644689 | 16890004 | 26578755 | 26578755 | 2020 | 2024 | 0 |
| 6 | Justin Herbert | 28237 | QB | LAC | 17 | 77.1 | 624 | 672 | 7.9 | 2.72 | ... | 22 | 4 | 26578755 | 6644689 | 16890004 | 26578755 | 26578755 | 2020 | 2024 | 0 |
| 7 | Justin Herbert | 28237 | QB | LAC | 15 | 75.8 | 553 | 595 | 7.8 | 2.61 | ... | 22 | 4 | 26578755 | 6644689 | 16890004 | 26578755 | 26578755 | 2020 | 2024 | 0 |
3 rows × 54 columns
joe_burrow = passing_df[passing_df['player'] == 'Joe Burrow']
joe_burrow
| player | player_id | position | team_name | player_game_count | accuracy_percent | aimed_passes | attempts | avg_depth_of_target | avg_time_to_throw | ... | Signed Age | Yrs | Value | AAV | Sign Bonus | G'teed @ Sign | Practical G'teed | start_year | end_year | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | Joe Burrow | 28022 | QB | CIN | 19 | 77.3 | 660 | 715 | 7.6 | 2.50 | ... | 23 | 4 | 36190137 | 9047534 | 23880100 | 36190137 | 36190137 | 2020 | 2024 | 0 |
| 9 | Joe Burrow | 28022 | QB | CIN | 20 | 79.8 | 619 | 662 | 8.2 | 2.62 | ... | 23 | 4 | 36190137 | 9047534 | 23880100 | 36190137 | 36190137 | 2020 | 2024 | 0 |
| 10 | Joe Burrow | 28022 | QB | CIN | 10 | 73.5 | 381 | 404 | 8.9 | 2.60 | ... | 23 | 4 | 36190137 | 9047534 | 23880100 | 36190137 | 36190137 | 2020 | 2024 | 0 |
3 rows × 54 columns
print("Justin Herbert estimated contract:", max(knn.predict(justin_herbert[selected_feature_names]))[0])
Justin Herbert estimated contract: 45401133.4
print("Joe Burrow estimated contract:", max(knn.predict(joe_burrow[selected_feature_names]))[0])
Joe Burrow estimated contract: 41358996.2
Based on real life intuition, these predictions are a bit on the lower side, possibly due to not being able to account for inflation without limiting the dataset too much. After exploring and analyzing the NFL player stats data, we observed several interesting findings. We found that certain features like yards, completions and dropbacks were highly correlated and not needed in a predictive model. Additionally, we were able to use machine learning techniques such as neural networks and K-Nearest Neighbors to predict player performance based on their data. This project goes through the data science lifecycle. We started with data collection and processing, moving on to exploration to better understand the data and identify any issues or missing values. We went through feature selection, selecting and transforming relevant features to improve model performance. Next, we used various machine learning algorithms to build models and make predictions on player performance. Overall, this project demonstrates the importance of using exploratory data analysis and machine learning techniques to gain insights from complex data. It highlights the iterative nature of the data science lifecycle, where each step informs and improves the subsequent steps, ultimately leading to a better understanding of the data and better models. We also proved a hypothesis that based on certain data we can predict players contracts on a certain interval of confidence.